[ 
https://issues.apache.org/jira/browse/HDFS-9590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070031#comment-15070031
 ] 

Xiao Chen commented on HDFS-9590:
---------------------------------

I think this is how the NPE happened. This looks to be a test specific issue.

In {{TestQJMWithFaults#testRecoverAfterDoubleFailures}}, we're trying to inject 
JN call failures in all possible permutations. In the test that I saw the NPE 
(which I will paste in a later comment), the following happened:
{noformat}
2015-12-20 18:51:46,820 WARN  namenode.FileJournalManager 
(FileJournalManager.java:startLogSegment(127)) - Unable to start log segment 7 
at 
/data/jenkins/workspace/CDH5-Hadoop-HDFS-2.6.0-Clover/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/journalnode-0/test-journal/current/edits_inprogress_0000000000000000007:
 null
2015-12-20 18:51:46,821 FATAL server.JournalNode 
(JournalNode.java:reportErrorOnFile(299)) - Error reported on file 
/data/jenkins/workspace/CDH5-Hadoop-HDFS-2.6.0-Clover/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/journalnode-0/test-journal/current/edits_inprogress_0000000000000000007...
 exiting
java.lang.Exception
        at 
org.apache.hadoop.hdfs.qjournal.server.JournalNode$ErrorReporter.reportErrorOnFile(JournalNode.java:299)
        at 
org.apache.hadoop.hdfs.server.namenode.FileJournalManager.startLogSegment(FileJournalManager.java:130)
        at 
org.apache.hadoop.hdfs.qjournal.server.Journal.startLogSegment(Journal.java:559)
        at 
org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.startLogSegment(JournalNodeRpcServer.java:162)
        at 
org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.startLogSegment(QJournalProtocolServerSideTranslatorPB.java:198)
        at 
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25425)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
{noformat}
Note that {{reportErrorOnFile}} will end up calling 
{{Storage$StorageDirectory#unlock}}, in the rpc call.
Meanwhile, since we also injected at 7, the majority of quorums failed, so 
{{AsyncLoggerSet#waitForWriteQuorum}} throws out an exception, and 
{{TestQJMWithFaults}} will shutdown cluster in the {{finally}} block, which 
also ends up calling {{Storage$StorageDirectory#unlock}}.


It looks to me that this can only happen in the tests, so impact is trivial. I 
want to revoke my initial thoughts of changing the {{unlock}} method, and think 
we'd better enhance the {{MiniJournalCluster#shutdown}} to handle this, if we 
decide to handle this at all. The only issue is that the NPE will cause the 
test to terminate early and hide the real exception. Attached patch 1 for a 
brief idea of this, please review and let me know if this is on the right 
track... Thanks!

> NPE in Storage$StorageDirectory#unlock()
> ----------------------------------------
>
>                 Key: HDFS-9590
>                 URL: https://issues.apache.org/jira/browse/HDFS-9590
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>         Attachments: HDFS-9590.01.patch
>
>
> The code looks to be possible to have race conditions in multiple-threaded 
> runs.
> {code}
>     public void unlock() throws IOException {
>       if (this.lock == null)
>         return;
>       this.lock.release();
>       lock.channel().close();
>       lock = null;
>     }
> {code}
> This is called in a handful of places, and I don't see any protection. Shall 
> we add some synchronization mechanism? Not sure if I missed any design 
> assumptions here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to