[ 
https://issues.apache.org/jira/browse/HDFS-10899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16113251#comment-16113251
 ] 

Daryn Sharp commented on HDFS-10899:
------------------------------------

I'm really trying to find cycles for this jira.  Since I keep seeing locking 
being mentioned, I've skimmed the patch.  A few quick observations:
* I see a lot of full path reconstruction.  This is not cheap.  Avoid requiring 
the full path if possible.
* Why can't a file with a re-encrypting EDEK be renamed?
* The batching behavior is a bit worrisome to me.  Reading this: "my gut 
feeling is this will not significantly block NN (or pause)" makes my gut drop. 
I need more proof than a feeling that this very desirable feature will not tank 
a production cluster.   See below.
* The managing of the depth-first scan claims that tracking a list of path 
components is cheaper than tracking inodes.  How/why?  Creating a list of boxed 
byte arrays has a lot more overhead than tracking inodes.  Back and forth 
resolving of path components to inodes and vice-versa is not cheap at all.

Not clear to me that proper locking is always being used.   I'll spiel what 
might already be known but it's what I'll look for:
* Holding the fsdir lock technically requires holding the fsn lock.
* If the fsn lock is released & reacquired, checkOperation must always be 
called to ensure the NN hasn't dropped into standby.  Logging an edit as a 
standby would be very very bad.
* IIPs cannot be resolved w/o the lock and are null and void if the lock is 
released.
* Edit logs are surprisingly not thread safe.  logEdit must hold the fsn write 
lock or rolling can corrupt the logs.
* logSync must never be called with the write lock.

Regarding edits, you don't always have to sync immediately.  Syncing is 
technically only required before sending a client a response to ensure 
durability.  If you just log edits they will be batched and eventually synced 
by the next write op.  It looks like if the NN was crash during reencryption, 
and buffered edits are lost, that it will correctly resume.

Didn't check out how big the batches are.  Lock time in general definitely 
becomes a concern, read lock or not.  Another consideration is an edit flood.  
Sending more edits than can be buffered without a sync will cause a sync while 
holding the write lock.  Not good.

As for performance.  I'm concerned you are focusing only on raw re-encrypt 
performance.  That means little to me.  Throughput during the process matters.  
Here's some rudimentary ways to measure performance on a quiescent NN.
* Blast the NN with read ops like getFileInfo.  Try to peg it.  Should be easy 
to achieve a steady state of 100-300k ops/sec.  Measure ops/sec during a 
lengthy reencrypt.
* Blast the NN with write ops.  Rename files back and forth in a directory.  
Try to peg it and depending on hw, maybe a steady 5-10k ops/sec.  Measure 
ops/sec during a lengthy reencrypt.
This will give us an idea of how costly the implementation is.  I'll make up 
some numbers.  During an emergency, a 10% hit might be acceptable.  10% is 
unacceptable for a proactive roll.

I'll comment further after I review more.

> Add functionality to re-encrypt EDEKs
> -------------------------------------
>
>                 Key: HDFS-10899
>                 URL: https://issues.apache.org/jira/browse/HDFS-10899
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: encryption, kms
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>         Attachments: editsStored, HDFS-10899.01.patch, HDFS-10899.02.patch, 
> HDFS-10899.03.patch, HDFS-10899.04.patch, HDFS-10899.05.patch, 
> HDFS-10899.06.patch, HDFS-10899.07.patch, HDFS-10899.08.patch, 
> HDFS-10899.09.patch, HDFS-10899.10.patch, HDFS-10899.10.wip.patch, 
> HDFS-10899.11.patch, HDFS-10899.12.patch, HDFS-10899.wip.2.patch, 
> HDFS-10899.wip.patch, Re-encrypt edek design doc.pdf, Re-encrypt edek design 
> doc V2.pdf
>
>
> Currently when an encryption zone (EZ) key is rotated, it only takes effect 
> on new EDEKs. We should provide a way to re-encrypt EDEKs after the EZ key 
> rotation, for improved security.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to