[ 
https://issues.apache.org/jira/browse/HDFS-10899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104471#comment-16104471
 ] 

Xiao Chen commented on HDFS-10899:
----------------------------------

Had a productive offline review session with [~andrew.wang], where we discussed 
about several things. Thanks Andrew!

- By design snapshots are immutable, so even after re-encryption, the snapshots 
of the EZ will still have old edeks. If there was security breach, admin's need 
to remove old snapshots, or take manual measures (e.g. cp, or mv to a 
non-snapshot dir then re-encrypt). Should add this to docs. This jira should 
not attempt to touch snapshots.
- Perf -> latency
I have been running this in a test cluster (with kerberos + SSL), with 1 sample 
EZ and 1M files. Cluster is on 
[GCE|https://cloud.google.com/compute/docs/machine-types], NN is n1-highmem-4, 
2 KMS instances on n1-standard-2.
Some perf numbers (when all 1M files have old edeks):
-- 1 edek thread: 40~50 mins.
-- 10 edek threads (5000 edeks per task): 13 mins
-- 30 edek threads (1000 edeks per task): 12 mins
Time to generate all the tasks for 1M files is ~10 seconds.\\
The bottleneck of the entire operation is on contacting KMS - from NN side the 
HTTPS connection to KMS took an average of single digit milliseconds per 
request, where inside the KMS the actual re-encryption only took 10s of 
microseconds. The default keep-alive of 5 connections is used, and the first 5 
connections (clean setup) took even longer.
This leads me to prototype a batched re-encryption interface on the KMS, and 
the perf of that is:
---   20 threads (1000 per task): 1.5 min.
Which well fits in our 200M files within 8 hours goal.

Discussing with Andrew, we felt the batched API is the way to go. I will file 
another jira to add a batched re-encryption API to the KMS, and update this 
patch to use that.

- Perf - memory
Above test is done without any throttling. We should throttle the 
{{ReencryptionHandler}} when instantiating Callables, to keep NN memory sane. 
The plan is to use a static calculation, so we only keep a configurable # of 
Callables in memory - the handler simply waits until a Callable is done and 
released before creating a new one. Will have a default number calculated from 
# of cores of the NN. Surely we should also considers how many edeks per 
Callable. Will implement this soon.

- Perf - lock throttling
Ideally we'd also throttle the {{ReencryptionHandler}} to control what % of 
time it can hold the readlock, and similarly {{ReencryptionUpdater}} for 
writelock. But since we already need to wait for the Callables, this kinda 
comes naturally. I.e. we won't be holding a writelock continuously for a long 
time. So we may not implement this in v1, pending confirmation from further 
perf runs.

- Failure handling:
Now that we use a batched re-encryption, it makes sense to simply retry the 
entire Callable (hence entire batch, since that just fails in 1 call). Then it 
sounds more admin-friendly to simply retry forever, with backoffs. If admin 
finds this annoying he can always cancel. This is better than the current way 
of fail the re-encryption after a few times, tell the admin, and force him to 
rerun the command.
Also should add fault injectors to unit test some failure scenarios.

> Add functionality to re-encrypt EDEKs
> -------------------------------------
>
>                 Key: HDFS-10899
>                 URL: https://issues.apache.org/jira/browse/HDFS-10899
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: encryption, kms
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>         Attachments: editsStored, HDFS-10899.01.patch, HDFS-10899.02.patch, 
> HDFS-10899.03.patch, HDFS-10899.04.patch, HDFS-10899.05.patch, 
> HDFS-10899.06.patch, HDFS-10899.07.patch, HDFS-10899.08.patch, 
> HDFS-10899.09.patch, HDFS-10899.10.patch, HDFS-10899.10.wip.patch, 
> HDFS-10899.11.patch, HDFS-10899.wip.2.patch, HDFS-10899.wip.patch, Re-encrypt 
> edek design doc.pdf, Re-encrypt edek design doc V2.pdf
>
>
> Currently when an encryption zone (EZ) key is rotated, it only takes effect 
> on new EDEKs. We should provide a way to re-encrypt EDEKs after the EZ key 
> rotation, for improved security.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to