[ https://issues.apache.org/jira/browse/HDFS-11142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395828#comment-16395828 ]
Chen Liang edited comment on HDFS-11142 at 3/12/18 8:33 PM: ------------------------------------------------------------ Hi [~linyiqun], Thanks for reporting this! One question, is there any analysis on how did the GC pause cause the NPE? Because I was not able to reproduce the error, and there is no NPE in the pasted log, it is not quite clear how did NPE happen and it is hard for me to tell whether the NPE will be gone with the patch. Did the patch fix the error in your environment? Is the stack trace of the NPE still available? In fact, it is interesting to me that a GC pause could cause an NPE here. I don't think this is supposed to happen...I think we'd better look more carefully on this, as it might be even some bug elsewhere rather than in the unit tests. Also, I'm not sure if it is the best way to catch all the exceptions here. Because we want the tests to report errors when they should, not swallowing all of the errors. There may be exceptions we do want to throw. Apart from that, one minor comments, you may use a lambda function for {{.waitFor()}} {code:java} GenericTestUtils.waitFor(() -> { // <-- use lambda boolean result = true; try { nnProxy.blockReport(bpRegistration, bpId, reports, new BlockReportContext(1, 0, reportId, fullBrLeaseId, sorted)); } catch (Exception e) { result = false; } return result; }, 3000, 120000);{code} was (Author: vagarychen): Hi [~linyiqun], Thanks for reporting this! One question, is there any analysis on how did the GC pause cause the NPE? Because I was not able to reproduce the error, and there is no NPE in the pasted log, it is not quite clear how did NPE happen and it is hard for me to tell whether the NPE will be gone with the patch. Did the patch fix the error in your environment? In fact, it is interesting to me that a GC pause could cause an NPE here. I don't think this is supposed to happen...I think we'd better look more carefully on this, as it might be even some bug elsewhere rather than in the unit tests. Also, I'm not sure if it is the best way to catch all the exceptions here. Because we want the tests to report errors when they should, not swallowing all of the errors. There may be exceptions we do want to throw. Apart from that, one minor comments, you may use a lambda function for {{.waitFor()}} {code:java} GenericTestUtils.waitFor(() -> { // <-- use lambda boolean result = true; try { nnProxy.blockReport(bpRegistration, bpId, reports, new BlockReportContext(1, 0, reportId, fullBrLeaseId, sorted)); } catch (Exception e) { result = false; } return result; }, 3000, 120000);{code} > TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit fails in > trunk > -------------------------------------------------------------------------------- > > Key: HDFS-11142 > URL: https://issues.apache.org/jira/browse/HDFS-11142 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Yiqun Lin > Assignee: Yiqun Lin > Priority: Major > Attachments: HDFS-11142.001.patch, test-fails-log.txt > > > The test > {{TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit}} fails > in trunk. I looked into this, it seemed the long-time gc caused the datanode > to be shutdown unexpectedly when did the large block reporting. And then the > NPE threw in the test. The related output log: > {code} > 2016-11-15 11:31:18,889 [DataNode: > [[[DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/2/dfs/data/data1, > > [DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/2/dfs/data/data2]] > heartbeating to localhost/127.0.0.1:51450] INFO datanode.DataNode > (BPServiceActor.java:blockReport(415)) - Successfully sent block report > 0x2ae5dd91bec02273, containing 2 storage report(s), of which we sent 2. The > reports had 0 total blocks and used 1 RPC(s). This took 0 msec to generate > and 49 msecs for RPC and NN processing. Got back one command: > FinalizeCommand/5. > 2016-11-15 11:31:18,890 [DataNode: > [[[DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/2/dfs/data/data1, > > [DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/2/dfs/data/data2]] > heartbeating to localhost/127.0.0.1:51450] INFO datanode.DataNode > (BPOfferService.java:processCommandFromActive(696)) - Got finalize command > for block pool BP-814229154-172.17.0.3-1479209475497 > 2016-11-15 11:31:24,026 > [org.apache.hadoop.util.JvmPauseMonitor$Monitor@97e93f1] INFO > util.JvmPauseMonitor (JvmPauseMonitor.java:run(205)) - Detected pause in JVM > or host machine (eg GC): pause of approximately 4936ms > GC pool 'PS MarkSweep' had collection(s): count=1 time=4194ms > GC pool 'PS Scavenge' had collection(s): count=1 time=765ms > 2016-11-15 11:31:24,026 > [org.apache.hadoop.util.JvmPauseMonitor$Monitor@5a4bef8] INFO > util.JvmPauseMonitor (JvmPauseMonitor.java:run(205)) - Detected pause in JVM > or host machine (eg GC): pause of approximately 4898ms > GC pool 'PS MarkSweep' had collection(s): count=1 time=4194ms > GC pool 'PS Scavenge' had collection(s): count=1 time=765ms > 2016-11-15 11:31:24,114 [main] INFO hdfs.MiniDFSCluster > (MiniDFSCluster.java:shutdown(1943)) - Shutting down the Mini HDFS Cluster > 2016-11-15 11:31:24,114 [main] INFO hdfs.MiniDFSCluster > (MiniDFSCluster.java:shutdownDataNodes(1983)) - Shutting down DataNode 0 > {code} > The stack infos: > {code} > java.lang.NullPointerException: null > at > org.apache.hadoop.hdfs.server.datanode.TestLargeBlockReport.testBlockReportSucceedsWithLargerLengthLimit(TestLargeBlockReport.java:97) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org