[jira] [Created] (HDFS-14045) Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN
Jiandan Yang created HDFS-14045: Summary: Use different metrics in DataNode to better measure latency of heartbeat/blockReports/incrementalBlockReports of Active/Standby NN Key: HDFS-14045 URL: https://issues.apache.org/jira/browse/HDFS-14045 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Reporter: Jiandan Yang Currently DataNode uses same metrics to measure rpc latency of NameNode, but Active and Standby usually have different performance at the same time, especially in large cluster. For example, rpc latency of Standby is very long when Standby is catching up editlog. We may misunderstand the state of HDFS. Using different metrics for Active and standby can help us obtain more precise metric data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13984) getFileInfo of libhdfs call NameNode#getFileStatus twice
Jiandan Yang created HDFS-13984: Summary: getFileInfo of libhdfs call NameNode#getFileStatus twice Key: HDFS-13984 URL: https://issues.apache.org/jira/browse/HDFS-13984 Project: Hadoop HDFS Issue Type: Improvement Components: libhdfs Reporter: Jiandan Yang Assignee: Jiandan Yang getFileInfo in hdfs.c calls *FileSystem#exists* first, then calls *FileSystem#getFileStatus*. *FileSystem#exists* also call *FileSystem#getFileStatus, just as follows: {code:java} public boolean exists(Path f) throws IOException { try { return getFileStatus(f) != null; } catch (FileNotFoundException e) { return false; } } {code} and finally this leads to call NameNodeRpcServer#getFileInfo twice. Actually we can implement by calling once. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13915) replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo
Jiandan Yang created HDFS-13915: Summary: replace datanode failed because of NameNodeRpcServer#getAdditionalDatanode returning excessive datanodeInfo Key: HDFS-13915 URL: https://issues.apache.org/jira/browse/HDFS-13915 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Environment: Reporter: Jiandan Yang Assignee: Jiandan Yang Consider following situation: 1. create a file with ALLSSD policy 2. return [SSD,SSD,DISK] due to lack of SSD space 3. client call NameNodeRpcServer#getAdditionalDatanode when recovering write pipeline and replacing bad datanode 4. BlockPlacementPolicyDefault#chooseTarget will call StoragePolicy#chooseStorageTypes(3, [SSD,DISK], none, false), but chooseStorageTypes return [SSD,SSD] 5. do numOfReplicas = requiredStorageTypes.size() and numOfReplicas is set to 2 and choose additional two datanodes 6. BlockPlacementPolicyDefault#chooseTarget return four datanodes to client 7. DataStreamer#findNewDatanode find nodes.length != original.length + 1 and throw IOException, and finally lead to write failed client warn logs is: \{code:java} WARN [DataStreamer for file /home/yarn/opensearch/in/data/120141286/0_65535/table/ucs_process/MANIFEST-093545 block BP-1742758844-11.138.8.184-1483707043031:blk_7086344902_6012765313] org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], DatanodeInfoWithStorage[11.138.5.9:50010,DS-f6d8eb8b-2550-474b-a692-c991d7a6f6b3,SSD], DatanodeInfoWithStorage[11.138.5.153:50010,DS-f5d77ca0-6fe3-4523-8ca8-5af975f845b6,SSD], DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]], original=[DatanodeInfoWithStorage[11.138.5.4:50010,DS-04826cfc-1885-4213-a58b-8606845c5c42,SSD], DatanodeInfoWithStorage[11.138.9.156:50010,DS-0d15ea12-1bad--84f7-1a4917a1e194,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12814) Add blockId when warning slow mirror/disk in BlockReceiver
Jiandan Yang created HDFS-12814: Summary: Add blockId when warning slow mirror/disk in BlockReceiver Key: HDFS-12814 URL: https://issues.apache.org/jira/browse/HDFS-12814 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Reporter: Jiandan Yang Assignee: Jiandan Yang Priority: Minor HDFS-11603 add downstream DataNodeIds and volume path. In order to better debug, those warnning log should include blockId -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12757) DeadLock Happened Between DFSOutputStream and LeaseRenewer when LeaseRenewer#renew SocketTimeException
Jiandan Yang created HDFS-12757: Summary: DeadLock Happened Between DFSOutputStream and LeaseRenewer when LeaseRenewer#renew SocketTimeException Key: HDFS-12757 URL: https://issues.apache.org/jira/browse/HDFS-12757 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Reporter: Jiandan Yang Priority: Major Java stack is : Found one Java-level deadlock: = "Topology-2 (735/2000)": waiting to lock monitor 0x7fff4523e6e8 (object 0x0005d3521078, a org.apache.hadoop.hdfs.client.impl.LeaseRenewer), which is held by "LeaseRenewer:admin@na61storage" "LeaseRenewer:admin@na61storage": waiting to lock monitor 0x7fff5d41e838 (object 0x0005ec0dfa88, a org.apache.hadoop.hdfs.DFSOutputStream), which is held by "Topology-2 (735/2000)" Java stack information for the threads listed above: === "Topology-2 (735/2000)": at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.addClient(LeaseRenewer.java:227) - waiting to lock <0x0005d3521078> (a org.apache.hadoop.hdfs.client.impl.LeaseRenewer) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.getInstance(LeaseRenewer.java:86) at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:467) at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:479) at org.apache.hadoop.hdfs.DFSOutputStream.setClosed(DFSOutputStream.java:776) at org.apache.hadoop.hdfs.DFSOutputStream.closeThreads(DFSOutputStream.java:791) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:848) - locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:805) - locked <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) .. "LeaseRenewer:admin@na61storage": at org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:750) - waiting to lock <0x0005ec0dfa88> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:586) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.run(LeaseRenewer.java:453) - locked <0x0005d3521078> (a org.apache.hadoop.hdfs.client.impl.LeaseRenewer) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer.access$700(LeaseRenewer.java:76) at org.apache.hadoop.hdfs.client.impl.LeaseRenewer$1.run(LeaseRenewer.java:310) at java.lang.Thread.run(Thread.java:834) Found 1 deadlock. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12748) Standby NameNode memory leak when accessing webhdfs GETHOMEDIRECTORY
Jiandan Yang created HDFS-12748: Summary: Standby NameNode memory leak when accessing webhdfs GETHOMEDIRECTORY Key: HDFS-12748 URL: https://issues.apache.org/jira/browse/HDFS-12748 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 2.8.2 Reporter: Jiandan Yang In our production environment, the standby NN often do fullgc, through mat we found the largest object is FileSystem$Cache, which contains 7,844,890 DistributedFileSystem. By view hierarchy of method FileSystem.get() , I found only NamenodeWebHdfsMethods#get call FileSystem.get(). I don't know why creating different DistributedFileSystem every time instead of get a FileSystem from cache. {code:java} case GETHOMEDIRECTORY: { final String js = JsonUtil.toJsonString("Path", FileSystem.get(conf != null ? conf : new Configuration()) .getHomeDirectory().toUri().getPath()); return Response.ok(js).type(MediaType.APPLICATION_JSON).build(); } {code} When we close FileSystem when GETHOMEDIRECTORY, NN don't do fullgc. {code:java} case GETHOMEDIRECTORY: { FileSystem fs = null; try { fs = FileSystem.get(conf != null ? conf : new Configuration()); final String js = JsonUtil.toJsonString("Path", fs.getHomeDirectory().toUri().getPath()); return Response.ok(js).type(MediaType.APPLICATION_JSON).build(); } finally { if (fs != null) { fs.close(); } } } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12638) NameNode exit due to NPE
Jiandan Yang created HDFS-12638: Summary: NameNode exit due to NPE Key: HDFS-12638 URL: https://issues.apache.org/jira/browse/HDFS-12638 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 2.8.2 Reporter: Jiandan Yang Active NamNode exit due to NPE, and I think BlockCollection 'bc' is Null by exclusion, but I do not know why bc is Null, By view history I guess this issue may be involved by [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754]. NN logs are as following: {code:java} 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: ReplicationMonitor thread received Runtime exception. java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744) at java.lang.Thread.run(Thread.java:834) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12446) FSNamesystem#internalReleaseLease throw IllegalStateException
Jiandan Yang created HDFS-12446: Summary: FSNamesystem#internalReleaseLease throw IllegalStateException Key: HDFS-12446 URL: https://issues.apache.org/jira/browse/HDFS-12446 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.8.1 Reporter: Jiandan Yang 2017-09-14 10:21:32,042 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder: DFSClient_NONMAPREDUCE_-275421369_84, pending creates: 7] has expired hard limit 2017-09-14 10:21:32,042 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-275421369_84, pending creates: 7], src=/user/ads/af_base_n_adf_p4p_pv/data/55f57d72-1542-4acf-b2d4-08af65b0e859 2017-09-14 10:21:32,042 WARN org.apache.hadoop.hdfs.server.namenode.LeaseManager: Unexpected throwable: java.lang.IllegalStateException: Unexpected block state: blk_1265519060_203004758 is COMMITTED but not COMPLETE, file=55f57d72-1542-4acf-b2d4-08af65b0e859 (INodeFile), blocks=[blk_1265519060_203004758] (i=0) at com.google.common.base.Preconditions.checkState(Preconditions.java:172) at org.apache.hadoop.hdfs.server.namenode.INodeFile.assertAllBlocksComplete(INodeFile.java:218) at org.apache.hadoop.hdfs.server.namenode.INodeFile.toCompleteFile(INodeFile.java:207) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:3312) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3184) at org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:383) at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:329) at java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12390) Supporting DNS to switch mapping
Jiandan Yang created HDFS-12390: Summary: Supporting DNS to switch mapping Key: HDFS-12390 URL: https://issues.apache.org/jira/browse/HDFS-12390 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs, hdfs-client Reporter: Jiandan Yang Assignee: Jiandan Yang As described in [HDFS-12200|https://issues.apache.org/jira/browse/HDFS-12200], ScriptBasedMapping may lead to NN cpu 100%. ScriptBasedMapping run sub_processor to get rack info of DN/Client, so we think it's a little heavy. We prepare to use TableMapping,but TableMapping does not support refresh and can not reload rack info of newly added DataNodes. So we implement it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12364) Compile Error:TestClientProtocolForPipelineRecovery#testUpdatePipeLineAfterDNReg
Jiandan Yang created HDFS-12364: Summary: Compile Error:TestClientProtocolForPipelineRecovery#testUpdatePipeLineAfterDNReg Key: HDFS-12364 URL: https://issues.apache.org/jira/browse/HDFS-12364 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 2.8.2 Reporter: Jiandan Yang Assignee: Jiandan Yang error line :dn1.setHeartbeatsDisabledForTests(true) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12348) disable removing block to trash while rolling upgrade
Jiandan Yang created HDFS-12348: Summary: disable removing block to trash while rolling upgrade Key: HDFS-12348 URL: https://issues.apache.org/jira/browse/HDFS-12348 Project: Hadoop HDFS Issue Type: New Feature Components: datanode Reporter: Jiandan Yang Assignee: Jiandan Yang DataNode remove block file and meta file to trash while rolling upgrade,and do delete when executing finalize. But frequently creating and deleting files leads to disk to be full(eg,Hbase compaction), and we will not rollbacking namespace in production when rolling upgrade fail. Disable trash of datanode maybe a good method to avoid disk to be full. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12200) Optimize CachedDNSToSwitchMapping to avoid cpu utilization is too high
Jiandan Yang created HDFS-12200: Summary: Optimize CachedDNSToSwitchMapping to avoid cpu utilization is too high Key: HDFS-12200 URL: https://issues.apache.org/jira/browse/HDFS-12200 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Jiandan Yang 1. Background : Our hadoop cluster is disaggregated storage and compute, HDFS is deployed to 600+ machines, YARN is deployed to another machine pool where off-line job and online service are run, Yarn's offline job will visit HDFS, but points The machines used for offline jobs are dynamically changing because the online service has a higher priority, and when the online service is idle, the machine will be assigned to offline tasks, and when the online service is busy, it will seize the resources of the offline job. We found that sometimes NameNode cpu utilization rate of 90% or even 100%. The most serious is cpu utilization rate of 100% for a long time result in writing journalNode timeout, eventually leading to NameNode hang up. The reason is offline tasks running in a few hundred servers access HDFS at the same time, NameNode resolve rack of client machine, started several hundred sub-process. {code:java} "process reaper"#10864 daemon prio=10 os_prio=0 tid=0x7fe270a31800 nid=0x38d93 runnable [0x7fcdc36fc000] java.lang.Thread.State: RUNNABLE at java.lang.UNIXProcess.waitForProcessExit(Native Method) at java.lang.UNIXProcess.lambda$initStreams$4(UNIXProcess.java:301) at java.lang.UNIXProcess$$Lambda$7/1447689627.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) at java.lang.Thread.run(Thread.java:834 {code} Our configuration as follows: {code:java} net.topology.node.switch.mapping.impl = ScriptBasedMapping, net.topology.script.file.name = 'a python script' {code} 2. Optimization In order to solve these two problems, we have optimized the CachedDNSToSwitchMapping (1) Added the DataNode IP list to the file of dfs.hosts configured. when NameNode starts it preloads DataNode rack information to the cache, get a batch of racks of hosts when running script once (the corresponding configuration is net.topology.script.number,the default value of 100) (2) Step (1) has ensured that the cache has all the DataNodes’ rack, so if the cache did not hit, then the host must be a client machine, then directly return /default-rack, (3) Each time you add new DataNodes you need to add the new DataNodes’ IP address to the file specified by dfs.hosts, and then run command of bin/hdfs dfsadmin -refreshNodes, it will put the newly added DataNodes’ rack into cache (4) Add new configuration items dfs.namenode.topology.resolve-non-cache-host, the value is false to open the above function, and the value is true to turn off the above functions, default value is true to keep compatibility -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Reopened] (HDFS-12177) NameNode exits due to setting BlockPlacementPolicy loglevel to Debug
[ https://issues.apache.org/jira/browse/HDFS-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang reopened HDFS-12177: -- > NameNode exits due to setting BlockPlacementPolicy loglevel to Debug > - > > Key: HDFS-12177 > URL: https://issues.apache.org/jira/browse/HDFS-12177 > Project: Hadoop HDFS > Issue Type: Bug > Components: block placement >Affects Versions: 2.8.1 >Reporter: Jiandan Yang >Assignee: Jiandan Yang > Attachments: HDFS_9668_1.patch > > > NameNode exits because the ReplicationMonitor thread internally throws NPE. > The reason for throwing NPE is that the builder field is not initialized whe > do log. > Solution: before appending it should determine whether the builder is null > {code:java} > if (LOG.isDebugEnabled()) { > builder = debugLoggingBuilder.get(); > builder.setLength(0); > builder.append("["); > } > some other codes ... > if (LOG.isDebugEnabled()) { > builder.append("\nNode ").append(NodeBase.getPath(chosenNode)) > .append(" ["); > } > some other codes ... > if (LOG.isDebugEnabled()) { > builder.append("\n]"); > } > {code} > NN exception log is : > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:722) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:689) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:640) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:608) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:483) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:419) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:266) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:119) > at > org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3768) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3720) > at java.lang.Thread.run(Thread.java:834) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12177) NameNode exits due to setting BlockPlacementPolicy loglevel to Debug
Jiandan Yang created HDFS-12177: Summary: NameNode exits due to setting BlockPlacementPolicy loglevel to Debug Key: HDFS-12177 URL: https://issues.apache.org/jira/browse/HDFS-12177 Project: Hadoop HDFS Issue Type: Bug Components: block placement Affects Versions: 2.8.1 Reporter: Jiandan Yang NameNode exits because the ReplicationMonitor thread internally throws NPE. The reason for throwing NPE is that the builder field is not initialized whe do log. Solution: before appending it should determine whether the builder is null {code:java} if (LOG.isDebugEnabled()) { builder = debugLoggingBuilder.get(); builder.setLength(0); builder.append("["); } some other codes ... if (LOG.isDebugEnabled()) { builder.append("\nNode ").append(NodeBase.getPath(chosenNode)) .append(" ["); } some other codes ... if (LOG.isDebugEnabled()) { builder.append("\n]"); } {code} NN exception log is : {code:java} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:722) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:689) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:640) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:608) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:483) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:419) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:266) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:119) at org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3768) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3720) at java.lang.Thread.run(Thread.java:834) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org