[ 
https://issues.apache.org/jira/browse/HDFS-10899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858880#comment-15858880
 ] 

Andrew Wang commented on HDFS-10899:
------------------------------------

Hi Xiao, I soaked in this patch for a while, had a bunch of review comments. I 
didn't get through everything, but I think this is worth posting now, and I'll 
give a more detailed review of the next rev:

ReencryptionHandler:
* Rather than having get/setProvider, can we set it in the EZManager 
constructor by getting it from {{dir.getFSNamesystem().getProvider()}} ? Then 
we don't need the retry loop in {{run}} either.
* In FSEditLog, normally we log one edit per call to a {{log}} function. This 
is broken by {{logReencryptZoneStatus}}, could you change it so 
ReencryptionHandler does the for loop over the batch calling {{logSetXAttrs}} 
then a final call to {{logReencryptZoneStatus}} ?
* We shouldn't call logSync inside the write lock in reencryptDir, since it's a 
time consuming operation. In the RPC context, we call this outside the lock, 
just before responding to the caller. This means there's a little gap where 
readers could see uncommitted data. For reencrypt, this is actually very safe 
since the old and new EDEK are functionally equivalent, and we aren't providing 
any guarantees while the zone is still reencrypting.
* We need the write lock held when logging to the FSEditLog.
* I'd prefer we gather all the file EDEK information in one shot with just the 
read lock held, do all the KMS communication, then take the writelock at the 
end to log the batch. This also makes it easier to multi-thread the KMS 
communication stage later, and better performance since we're doing more work 
under the lock.
* There are lock/unlocks across functions which makes this harder to 
understand. For instance, unlock then relock in reencryptINode because it was 
locked in the caller. Consider refactoring? It'd be great if all locking was 
handled with simple lock/unlock in try/finally.
* restoreReencryptFromProgress, javadoc mentions {{lastFileProcessed}} but 
abbreviations here and below are {{lpf}}.
* Is it possible to rework reencryptEncryptionZoneInt so we only call 
reencryptDir once after setting up the iteration state? I think this would also 
make restoreReencryptFromProgress clearer, since it could have fewer arguments.
* Would also consider renaming restoreReencryptFromProgress, like 
{{restoreFromLastFileProcessed}}
* Is it possible to encapsulate the file iteration as an iterator? We pass 
{{subdirs}} around a lot right now.

ReencryptionStatus:
* Add Precondition check to add, to make sure that it's not added if already 
present

EZManager:
* typo: zondId
* What does the "Id" in {{getIdRootEncryptionZone}} mean?
* getIdRootEncryptionZone checks for read in resolvePath, but aren't reencrypt 
and cancel write operations? We don't want to be running this on snapshot 
paths. I think we should push the resolvePath up to the caller of this 
function, since it also lets us reuse the IIP if necessary (resolvePath is 
expensive).
* In FSNamesystem, looks like we call {{resolvePath}} in 
reencryptEncryptionZoneInt but not the IIP version. The normal flow is to 
resolvePath to an IIP in the corresponding FSDir file, then pass it down to 
avoid resolving again for performance reasons.

ReencryptionHandler#checkNNReadyForWrite and catching exceptions
* {{checkNNReadyForWrite}} checks that the NN is active. The handler should 
only be live when the NN is active though, stopActiveServices will interrupt 
the thread. If we're worried about this, it'd be better to periodically check 
for the interrupt and throw InterruptedException.
* With this, we can change {{checkNNReadyForWrite}} into a simple safemode 
check. Since exceptions are expensive, I'd prefer a boolean return value and if 
statement instead.
* Note also that safemode checks need to be done with the lock held, so we 
should do it where we're grabbing the lock anyway. For correctness, I think we 
only need to check this just before we do a write (i.e. apply the new EDEKs to 
our in-memory state). We can still check elsewhere to save doing extra work.
* Related to the above, in two places in {{run}} we catch InterruptedException 
and just log and return. This is a code smell; although it's the top-level 
function, it's still better to re-interrupt the thread before returning.
* I like this article I found on InterruptedException: 
https://www.ibm.com/developerworks/library/j-jtp05236/
* Catching Exception in {{run}} is a code smell. What is the intent? It looks 
like we already catch the checked exceptions, so this will catch 
RuntimeExceptions (which are normally unrecoverable).

Metrics:
* I'd prefer that we track the per zone metrics in ReencryptionStatus, since 
it's what also tracks the currentZone. This avoids having to reset the metrics 
on removeZone.
* Follow-on idea: it'd be nice for admins to be able to query the status of 
queued and running reencrypt commands. Progress indicators like submission 
time, start time, # skipped, # reencrypted, total # (if this is cheap to get) 
would be helpful.

Others:
* Looks like we aren't using the op cache in FSEditLog SetXAttrOp / 
RemoveXAttrOp. I think this is accidental, could you do some research? 
Particularly since we'll be doing a lot of SetXAttrOps, avoiding all that 
object allocation would be nice. This could be a separate JIRA.
* DFSTestUtil, the new comment says "Pause re-encrypt to allow the cancel 
command can go through", but IIUC this isn't about authorization but instead 
pausing reencryption so there's an active zone reencryption to cancel. Could 
you reword?
* fsimage.proto and loader code, since the lastFile belongs to a specific zone, 
I'd prefer we define the on-disk format like this rather than needing to know 
about zone ordering:
* Rather than making FSNamesystem#getFileInfo public for tests, can we dig it 
out from the RPC interface?

{code}
message EncryptionManagerSection {
  message ReencryptCommand {
    required uint64 zone     = 1;
    optional string lastFile = 2;
  }
  repeated ReencryptCommand reencryptCommands = 1;
}
{code}

With the current on-disk format, we can only reencrypt one zone at a time. 
Later, we might want to reencrypt in parallel.

Follow-on:
* This could be difficult with all the lock/unlocks and stages, but I'd prefer 
a goal-pause-time configuration for the {{run}} loop. This is easier for admins 
to reason about. We would still use the batch size for determining when to log 
a batch.

> Add functionality to re-encrypt EDEKs.
> --------------------------------------
>
>                 Key: HDFS-10899
>                 URL: https://issues.apache.org/jira/browse/HDFS-10899
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: encryption, kms
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>         Attachments: editsStored, HDFS-10899.01.patch, HDFS-10899.02.patch, 
> HDFS-10899.03.patch, HDFS-10899.04.patch, HDFS-10899.05.patch, 
> HDFS-10899.06.patch, HDFS-10899.07.patch, HDFS-10899.wip.2.patch, 
> HDFS-10899.wip.patch, Re-encrypt edek design doc.pdf
>
>
> Currently when an encryption zone (EZ) key is rotated, it only takes effect 
> on new EDEKs. We should provide a way to re-encrypt EDEKs after the EZ key 
> rotation, for improved security.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to