[jira] [Comment Edited] (HDFS-11142) TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit fails in trunk

Chen Liang (JIRA) Mon, 12 Mar 2018 13:34:33 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395828#comment-16395828
 ]


Chen Liang edited comment on HDFS-11142 at 3/12/18 8:33 PM:
------------------------------------------------------------

Hi [~linyiqun],

Thanks for reporting this! One question, is there any analysis on how did the 
GC pause cause the NPE? Because I was not able to reproduce the error, and 
there is no NPE in the pasted log, it is not quite clear how did NPE happen and 
it is hard for me to tell whether the NPE will be gone with the patch. Did the 
patch fix the error in your environment? Is the stack trace of the NPE still 
available?

In fact, it is interesting to me that a GC pause could cause an NPE here. I 
don't think this is supposed to happen...I think we'd better look more 
carefully on this, as it might be even some bug elsewhere rather than in the 
unit tests. Also, I'm not sure if it is the best way to catch all the 
exceptions here. Because we want the tests to report errors when they should, 
not swallowing all of the errors. There may be exceptions we do want to throw.

Apart from that, one minor comments, you may use a lambda function for 
{{.waitFor()}}
{code:java}
GenericTestUtils.waitFor(() -> { // <-- use lambda
  boolean result = true;
  try {
    nnProxy.blockReport(bpRegistration, bpId, reports,
        new BlockReportContext(1, 0, reportId, fullBrLeaseId, sorted));
  } catch (Exception e) {
    result = false;
  }
  return result;
}, 3000, 120000);{code}
 


was (Author: vagarychen):
Hi [~linyiqun],

Thanks for reporting this! One question, is there any analysis on how did the 
GC pause cause the NPE? Because I was not able to reproduce the error, and 
there is no NPE in the pasted log, it is not quite clear how did NPE happen and 
it is hard for me to tell whether the NPE will be gone with the patch. Did the 
patch fix the error in your environment?

In fact, it is interesting to me that a GC pause could cause an NPE here. I 
don't think this is supposed to happen...I think we'd better look more 
carefully on this, as it might be even some bug elsewhere rather than in the 
unit tests. Also, I'm not sure if it is the best way to catch all the 
exceptions here. Because we want the tests to report errors when they should, 
not swallowing all of the errors. There may be exceptions we do want to throw.

Apart from that, one minor comments, you may use a lambda function for 
{{.waitFor()}}
{code:java}
GenericTestUtils.waitFor(() -> { // <-- use lambda
  boolean result = true;
  try {
    nnProxy.blockReport(bpRegistration, bpId, reports,
        new BlockReportContext(1, 0, reportId, fullBrLeaseId, sorted));
  } catch (Exception e) {
    result = false;
  }
  return result;
}, 3000, 120000);{code}
 

> TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit fails in 
> trunk
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-11142
>                 URL: https://issues.apache.org/jira/browse/HDFS-11142
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Yiqun Lin
>            Assignee: Yiqun Lin
>            Priority: Major
>         Attachments: HDFS-11142.001.patch, test-fails-log.txt
>
>
> The test 
> {{TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit}} fails 
> in trunk. I looked into this, it seemed the long-time gc caused the datanode 
> to be shutdown unexpectedly when did the large block reporting. And then the 
> NPE threw in the test. The related output log:
> {code}
> 2016-11-15 11:31:18,889 [DataNode: 
> [[[DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/2/dfs/data/data1,
>  
> [DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/2/dfs/data/data2]]
>   heartbeating to localhost/127.0.0.1:51450] INFO  datanode.DataNode 
> (BPServiceActor.java:blockReport(415)) - Successfully sent block report 
> 0x2ae5dd91bec02273,  containing 2 storage report(s), of which we sent 2. The 
> reports had 0 total blocks and used 1 RPC(s). This took 0 msec to generate 
> and 49 msecs for RPC and NN processing. Got back one command: 
> FinalizeCommand/5.
> 2016-11-15 11:31:18,890 [DataNode: 
> [[[DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/2/dfs/data/data1,
>  
> [DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/2/dfs/data/data2]]
>   heartbeating to localhost/127.0.0.1:51450] INFO  datanode.DataNode 
> (BPOfferService.java:processCommandFromActive(696)) - Got finalize command 
> for block pool BP-814229154-172.17.0.3-1479209475497
> 2016-11-15 11:31:24,026 
> [org.apache.hadoop.util.JvmPauseMonitor$Monitor@97e93f1] INFO  
> util.JvmPauseMonitor (JvmPauseMonitor.java:run(205)) - Detected pause in JVM 
> or host machine (eg GC): pause of approximately 4936ms
> GC pool 'PS MarkSweep' had collection(s): count=1 time=4194ms
> GC pool 'PS Scavenge' had collection(s): count=1 time=765ms
> 2016-11-15 11:31:24,026 
> [org.apache.hadoop.util.JvmPauseMonitor$Monitor@5a4bef8] INFO  
> util.JvmPauseMonitor (JvmPauseMonitor.java:run(205)) - Detected pause in JVM 
> or host machine (eg GC): pause of approximately 4898ms
> GC pool 'PS MarkSweep' had collection(s): count=1 time=4194ms
> GC pool 'PS Scavenge' had collection(s): count=1 time=765ms
> 2016-11-15 11:31:24,114 [main] INFO  hdfs.MiniDFSCluster 
> (MiniDFSCluster.java:shutdown(1943)) - Shutting down the Mini HDFS Cluster
> 2016-11-15 11:31:24,114 [main] INFO  hdfs.MiniDFSCluster 
> (MiniDFSCluster.java:shutdownDataNodes(1983)) - Shutting down DataNode 0
> {code}
> The stack infos:
> {code}
> java.lang.NullPointerException: null
>       at 
> org.apache.hadoop.hdfs.server.datanode.TestLargeBlockReport.testBlockReportSucceedsWithLargerLengthLimit(TestLargeBlockReport.java:97)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-11142) TestLargeBlockReport#testBlockReportSucceedsWithLargerLengthLimit fails in trunk

Reply via email to