[ 
https://issues.apache.org/jira/browse/HDFS-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960391#comment-14960391
 ] 

Steve Loughran commented on HDFS-9249:
--------------------------------------

Wei, in preparing for the next CDH release, you don't something wonderful
1. Found a new way for the NN to fail, and got a stack trace off it.
2. Delved into that stack trace to come up with some reasonable hypothesis as 
to the cause

Those are achievements to exploit

bq.  It's supportability improvement therefore I deem there is no need for a 
test case. 

If it was just the logging I'd agree, though looking at the destination code 
I'd ask for what I'm asking for first: use the java7 combined exception handler 
to also catch and log the {{HadoopIllegalArgumentException}}.

But It's a fix for an NPE on shutdown, which is something we can verify.

Given the code path you highlit does appear to trigger an IOE if you enable 
hadoop auth without defining a NN principal, the IOE should be repeatedly 
triggered. This means we can have a test which sets up an NN appropriately and 
then triggers the failure. 

*This can then verify that the changes to the {{stop()}} call are the ones 
needed for  a robust shutdown.*

That's why I want a test: to verify that all the problems have been identified 
and fixed. 

You've done the hard work: identify a possible cause, come up with a probable 
fix. All it needs is a little test to round it off -one which will even 
validate your hypothesis about probable cause.




> NPE thrown if an IOException is thrown in NameNode.<init>
> ---------------------------------------------------------
>
>                 Key: HDFS-9249
>                 URL: https://issues.apache.org/jira/browse/HDFS-9249
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Minor
>              Labels: supportability
>         Attachments: HDFS-9249.001.patch
>
>
> This issue was found when running test case 
> TestBackupNode.testCheckpointNode, but upon closer look, the problem is not 
> due to the test case.
> Looks like an IOException was thrown in
>     try {
>       initializeGenericKeys(conf, nsId, namenodeId);
>       initialize(conf);
>       try {
>         haContext.writeLock();
>         state.prepareToEnterState(haContext);
>         state.enterState(haContext);
>       } finally {
>         haContext.writeUnlock();
>       }
> causing the namenode to stop, but the namesystem was not yet properly 
> instantiated, causing NPE.
> I tried to reproduce locally, but to no avail.
> Because I could not reproduce the bug, and the log does not indicate what 
> caused the IOException, I suggest make this a supportability JIRA to log the 
> exception for future improvement.
> Stacktrace
> java.lang.NullPointerException: null
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.getFSImage(NameNode.java:906)
> at org.apache.hadoop.hdfs.server.namenode.BackupNode.stop(BackupNode.java:210)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:827)
> at 
> org.apache.hadoop.hdfs.server.namenode.BackupNode.<init>(BackupNode.java:89)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1474)
> at 
> org.apache.hadoop.hdfs.server.namenode.TestBackupNode.startBackupNode(TestBackupNode.java:102)
> at 
> org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpoint(TestBackupNode.java:298)
> at 
> org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpointNode(TestBackupNode.java:130)
> The last few lines of log:
> 2015-10-14 19:45:07,807 INFO namenode.NameNode 
> (NameNode.java:createNameNode(1422)) - createNameNode [-checkpoint]
> 2015-10-14 19:45:07,807 INFO impl.MetricsSystemImpl 
> (MetricsSystemImpl.java:init(158)) - CheckpointNode metrics system started 
> (again)
> 2015-10-14 19:45:07,808 INFO namenode.NameNode 
> (NameNode.java:setClientNamenodeAddress(402)) - fs.defaultFS is 
> hdfs://localhost:37835
> 2015-10-14 19:45:07,808 INFO namenode.NameNode 
> (NameNode.java:setClientNamenodeAddress(422)) - Clients are to use 
> localhost:37835 to access this namenode/service.
> 2015-10-14 19:45:07,810 INFO hdfs.MiniDFSCluster 
> (MiniDFSCluster.java:shutdown(1708)) - Shutting down the Mini HDFS Cluster
> 2015-10-14 19:45:07,810 INFO namenode.FSNamesystem 
> (FSNamesystem.java:stopActiveServices(1298)) - Stopping services started for 
> active state
> 2015-10-14 19:45:07,811 INFO namenode.FSEditLog 
> (FSEditLog.java:endCurrentLogSegment(1228)) - Ending log segment 1
> 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem 
> (FSNamesystem.java:run(5306)) - NameNodeEditLogRoller was interrupted, exiting
> 2015-10-14 19:45:07,811 INFO namenode.FSEditLog 
> (FSEditLog.java:printStatistics(703)) - Number of transactions: 3 Total time 
> for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of 
> syncs: 4 SyncTimes(ms): 2 1 
> 2015-10-14 19:45:07,811 INFO namenode.FSNamesystem 
> (FSNamesystem.java:run(5373)) - LazyPersistFileScrubber was interrupted, 
> exiting
> 2015-10-14 19:45:07,822 INFO namenode.FileJournalManager 
> (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file 
> /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_inprogress_0000000000000000001
>  -> 
> /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/edits_0000000000000000001-0000000000000000003
> 2015-10-14 19:45:07,835 INFO namenode.FileJournalManager 
> (FileJournalManager.java:finalizeLogSegment(142)) - Finalizing edits file 
> /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2/current/edits_inprogress_0000000000000000001
>  -> 
> /data/jenkins/workspace/CDH5.5.0-Hadoop-HDFS-2.6.0/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name2/current/edits_0000000000000000001-0000000000000000003
> 2015-10-14 19:45:07,836 INFO blockmanagement.CacheReplicationMonitor 
> (CacheReplicationMonitor.java:run(169)) - Shutting down 
> CacheReplicationMonitor
> 2015-10-14 19:45:07,836 INFO ipc.Server (Server.java:stop(2485)) - Stopping 
> server on 37835
> 2015-10-14 19:45:07,837 INFO ipc.Server (Server.java:run(718)) - Stopping IPC 
> Server listener on 37835
> 2015-10-14 19:45:07,837 INFO ipc.Server (Server.java:run(844)) - Stopping IPC 
> Server Responder
> 2015-10-14 19:45:07,837 INFO blockmanagement.BlockManager 
> (BlockManager.java:run(3781)) - Stopping ReplicationMonitor.
> 2015-10-14 19:45:07,838 WARN blockmanagement.DecommissionManager 
> (DecommissionManager.java:run(78)) - Monitor interrupted: 
> java.lang.InterruptedException: sleep interrupted
> 2015-10-14 19:45:07,844 INFO namenode.FSNamesystem 
> (FSNamesystem.java:stopActiveServices(1298)) - Stopping services started for 
> active state
> 2015-10-14 19:45:07,845 INFO namenode.FSNamesystem 
> (FSNamesystem.java:stopStandbyServices(1386)) - Stopping services started for 
> standby state
> 2015-10-14 19:45:07,848 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped 
> HttpServer2$SelectChannelConnectorWithSafeStartup@localhost:0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to