[ https://issues.apache.org/jira/browse/HDFS-13731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590030#comment-16590030 ]
Zsolt Venczel commented on HDFS-13731: -------------------------------------- While investigating the above timeouts I found the following concurrency issue: * while the ReencryptionUpdate.processCheckpoints method is executing and removing tasks from the task list * on a different thread a new re-encryption task can be added to the same task list by calling ReencryptionHandler.submitCurrentBatch that calls ZoneSubmissionTracker.addTask My latest patch contains a proposal to prevent this. I've attached the full log produced for the issue. The important section where the *processCheckpoints* iterations are still running but a new ZoneSubmissionTracker task is being added: {code:java} 2018-08-22 17:16:01,535 INFO FSTreeTraverser - Submitted batch (start:/zones/zone/0, size:5) of zone 16387 to re-encrypt. 2018-08-22 17:16:01,535 INFO ReencryptionHandler - Processing batched re-encryption for zone 16387, batch size 5, start:/zones/zone/0 2018-08-22 17:16:01,536 INFO ReencryptionHandler - Completed re-encrypting one batch of 5 edeks from KMS, time consumed: 922873, start: /zones/zone/0. 2018-08-22 17:16:01,536 INFO ReencryptionUpdater - Processing returned re-encryption task for zone /zones/zone(16387), batch size 5, start:/zones/zone/0 2018-08-22 17:16:01,536 DEBUG ReencryptionUpdater - Updating file xattrs for re-encrypting zone /zones/zone, starting at /zones/zone/0 2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16388 for re-encryption. 2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16389 for re-encryption. 2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16390 for re-encryption. 2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16391 for re-encryption. 2018-08-22 17:16:01,536 TRACE ReencryptionUpdater - Updating 16392 for re-encryption. 2018-08-22 17:16:01,536 INFO ReencryptionUpdater - Updated xattrs on 5(5) files in zone /zones/zone for re-encryption, starting:/zones/zone/0. 2018-08-22 17:16:01,536 DEBUG ReencryptionUpdater - Updating re-encryption checkpoint with completed task. last: /zones/zone/4 size:5. 2018-08-22 17:16:01,536 INFO FSTreeTraverser - Submitted batch (start:/zones/zone/5, size:5) of zone 16387 to re-encrypt. 2018-08-22 17:16:01,536 INFO ReencryptionHandler - Processing batched re-encryption for zone 16387, batch size 5, start:/zones/zone/5 2018-08-22 17:16:01,537 ERROR ReencryptionUpdater - Re-encryption updater thread exiting. java.util.ConcurrentModificationException at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:966) at java.util.LinkedList$ListItr.remove(LinkedList.java:921) at org.apache.hadoop.hdfs.server.namenode.ReencryptionUpdater.processCheckpoints(ReencryptionUpdater.java:411) at org.apache.hadoop.hdfs.server.namenode.ReencryptionUpdater.processTask(ReencryptionUpdater.java:488) at org.apache.hadoop.hdfs.server.namenode.ReencryptionUpdater.takeAndProcessTasks(ReencryptionUpdater.java:437) at org.apache.hadoop.hdfs.server.namenode.ReencryptionUpdater.run(ReencryptionUpdater.java:264) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2018-08-22 17:16:01,537 INFO ReencryptionHandler - Submission completed of zone 16387 for re-encryption. {code} Which results in cancelling the re-encryption tasks: {code:java} 2018-08-22 17:16:51,612 INFO ReencryptionUpdater - Cancelling 2 re-encryption tasks ... 2018-08-22 17:16:51,621 INFO ReencryptionUpdater - Cancelling 2 re-encryption tasks {code} My uploaded patch fixes two other test related issues: * sometimes in the testRestartAfterReencryptAndCheckpoint fs.saveNamespace() call was performing slow therefore we should wait for it to finish the operation * cancelFutureDuringReencryption method introduced a race condition as at {code:java} callableRunning.set(true); Thread.sleep(Long.MAX_VALUE);{code} between setting the callableRunning to true and sleeping the thread a concurrent modification can happen in rare cases. > ReencryptionUpdater fails with ConcurrentModificationException during > processCheckpoints > ---------------------------------------------------------------------------------------- > > Key: HDFS-13731 > URL: https://issues.apache.org/jira/browse/HDFS-13731 > Project: Hadoop HDFS > Issue Type: Bug > Components: encryption, test > Affects Versions: 3.0.0 > Reporter: Xiao Chen > Assignee: Zsolt Venczel > Priority: Major > Attachments: HDFS-13731-failure.log > > > HDFS-12837 fixed some flakiness of Reencryption related tests. But as > [~zvenczel]'s comment, there are a few timeouts still. We should investigate > that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org