HDFS reporting varying number of under-replicated blocks; MISSING blocks
Hello all, I am getting wildly varying reports of under-replicated blocks from dfsadmin and fsck. I am wondering what's causing this. hadoop dfsadmin -metasave reports ~1,000 actual blocks awaiting replication and about ~404,000 MISSING blocks awaiting replication. How do I fix this? Why are there so many MISSING blocks? All data nodes have checked in and none are dead. The number of missing blocks has been increasing by about 50,000 to 100,000 each day for the past few days. Jobs run just fine and there does not appear to be data missing. Please see output from the following commands below: hadoop fsck /, hadoop dfsadmin -report, hadoop dfsadmin -metasave, and the namenode web GUI. DFSadmin report: Configured Capacity: 59070545264640 (53.72 TB) Present Capacity: 56104851291863 (51.03 TB) DFS Remaining: 35451062198272 (32.24 TB) DFS Used: 20653789093591 (18.78 TB) DFS Used%: 36.81% Under replicated blocks: 405571 Blocks with corrupt replicas: 110 Missing blocks: 0 Hadoop fsck /: Total size:6793151991960 B (Total open files size: 46894626752 B) Total dirs:1748 Total files: 229033 (Files currently being written: 486) Total blocks (validated): 244648 (avg. block size 27767044 B) (Total open file blocks (not validated): 675) Minimally replicated blocks: 244648 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 931 (0.38054675 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 2.9945922 Corrupt blocks:0 Missing replicas: 1844 (0.25169903 %) Number of data-nodes: 20 Number of racks: 1 FSCK ended at Fri Nov 04 08:26:25 CDT 2011 in 7222 milliseconds The filesystem under path '/' is HEALTHY Hadoop dfsadmin -metasave excerpt: 231268 files and directories, 245326 blocks = 476594 total Live Datanodes: 20 Dead Datanodes: 0 Metasave: Blocks waiting for replication: 404829 /file/logs/prod/A/20111031/kj9384-91-prod-oidu.20111028.213210: blk_-9102337703676304543_9404421 (replicas: l: 1 d: 0 c: 0 e: 0) 33.112.130.196:50010 : .. : blk_9223167009736432600_10568409 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10574235 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10579879 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10582788 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10588269 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10595646 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10598074 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10600859 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10606703 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10612156 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10614776 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10624115 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10630240 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10634047 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10637486 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10644655 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10653784 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10657780 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10661247 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) : blk_9223167009736432600_10664704 MISSING (replicas: l: 0 d: 0 c: 0 e: 0) Metasave: Blocks being replicated: 0 Metasave: Blocks 0 waiting deletion from 0 datanodes. Metasave: Number of datanodes: 20 Web GUI: Cluster Summary 231267 files and directories, 245316 blocks = 476583 total. Heap Size is 626.94 MB / 8.68 GB (7%) Configured Capacity: 53.72 TB DFS Used : 19.05 TB Non DFS Used : 2.43 TB DFS Remaining : 32.25 TB DFS Used% : 35.46 % DFS Remaining%: 60.02 % Live Nodes : 20 Dead Nodes : 0 Decommissioning Nodes :0 Number of Under-Replicated Blocks: 405943 Shawn Higgins Thomson Reuters shawn.higg...@thomsonreuters.com thomsonreuters.com
Re: HDFS error : Could not Complete file
Hi, Thanks for the reply. There's no delete command issued from the client code. FYR, I have attached the program that's used to reproduce this bug. The input contains a simple CSV file with 2 million entries. Thanks Sudhan S On Fri, Nov 4, 2011 at 4:42 PM, Uma Maheswara Rao G 72686 < mahesw...@huawei.com> wrote: > Looks before comlpeting the file, folder has been deleted. > In HDFS, we will be able to delete the files any time. Application need to > take care about the file comleteness depending on his usage. > Do you have any dfsclient side logs in mapreduce, when exactly delete > command issued? > - Original Message - > From: Sudharsan Sampath > Date: Friday, November 4, 2011 2:48 pm > Subject: HDFS error : Could not Complete file > To: hdfs-user@hadoop.apache.org > > > Hi, > > > > I have a simple map-reduce program [map only :) ]that reads the > > input and > > emits the same to n outputs on a single node cluster with max map > > tasks set > > to 10 on a 16 core processor machine. > > > > After a while the tasks begin to fail with the following exception > > log. > > 2011-01-01 03:17:52,149 INFO > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > > ugi=temp,tempip=/x.x.x.x cmd=delete > > > > > src=/TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2 > dst=nullperm=null > > 2011-01-01 03:17:52,156 INFO org.apache.hadoop.hdfs.StateChange: > > BLOCK*NameSystem.addStoredBlock: addStoredBlock request received for > > blk_7046642930904717718_23143 on x.x.x.x: size 66148 But it > > does not > > belong to any file. > > 2011-01-01 03:17:52,156 WARN org.apache.hadoop.hdfs.StateChange: DIR* > > NameSystem.completeFile: failed to complete > > > /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0 > > because dir.getFileBlocks() is null and pendingFile is null > > 2011-01-01 03:17:52,156 INFO org.apache.hadoop.ipc.Server: IPC Server > > handler 12 on 9000, call > > > complete(/TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0, > > DFSClient_attempt_201101010256_0006_m_00_2) from > > x.x.x.x: error: > > java.io.IOException: Could not complete write to file > > > /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0 > > by DFSClient_attempt_201101010256_0006_m_00_2 > > java.io.IOException: Could not complete write to file > > > /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0 > > by DFSClient_attempt_201101010256_0006_m_00_2 > >at > > > org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497) > >at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown > > Source)at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >at java.lang.reflect.Method.invoke(Method.java:597) > >at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512) > >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968) > >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) > >at java.security.AccessController.doPrivileged(Native Method) > >at javax.security.auth.Subject.doAs(Subject.java:396) > >at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962) > > > > > > Looks like there's a delete command issued by FsNameSystem.audit > > before the > > it errors out stating it could not complete write to the file > > inside that.. > > > > Any clue on what could have gone wrong? > > > > Thanks > > Sudharsan S > > > TestMultipleOutputs.java Description: Binary data
Re: HDFS error : Could not Complete file
Looks before comlpeting the file, folder has been deleted. In HDFS, we will be able to delete the files any time. Application need to take care about the file comleteness depending on his usage. Do you have any dfsclient side logs in mapreduce, when exactly delete command issued? - Original Message - From: Sudharsan Sampath Date: Friday, November 4, 2011 2:48 pm Subject: HDFS error : Could not Complete file To: hdfs-user@hadoop.apache.org > Hi, > > I have a simple map-reduce program [map only :) ]that reads the > input and > emits the same to n outputs on a single node cluster with max map > tasks set > to 10 on a 16 core processor machine. > > After a while the tasks begin to fail with the following exception > log. > 2011-01-01 03:17:52,149 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=temp,tempip=/x.x.x.x cmd=delete > > src=/TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2 > dst=nullperm=null > 2011-01-01 03:17:52,156 INFO org.apache.hadoop.hdfs.StateChange: > BLOCK*NameSystem.addStoredBlock: addStoredBlock request received for > blk_7046642930904717718_23143 on x.x.x.x: size 66148 But it > does not > belong to any file. > 2011-01-01 03:17:52,156 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: failed to complete > /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0 > because dir.getFileBlocks() is null and pendingFile is null > 2011-01-01 03:17:52,156 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 12 on 9000, call > complete(/TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0, > DFSClient_attempt_201101010256_0006_m_00_2) from > x.x.x.x: error: > java.io.IOException: Could not complete write to file > /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0 > by DFSClient_attempt_201101010256_0006_m_00_2 > java.io.IOException: Could not complete write to file > /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0 > by DFSClient_attempt_201101010256_0006_m_00_2 >at > org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497) >at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown > Source)at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >at java.lang.reflect.Method.invoke(Method.java:597) >at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512) >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968) >at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) >at java.security.AccessController.doPrivileged(Native Method) >at javax.security.auth.Subject.doAs(Subject.java:396) >at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962) > > > Looks like there's a delete command issued by FsNameSystem.audit > before the > it errors out stating it could not complete write to the file > inside that.. > > Any clue on what could have gone wrong? > > Thanks > Sudharsan S >
HDFS error : Could not Complete file
Hi, I have a simple map-reduce program [map only :) ]that reads the input and emits the same to n outputs on a single node cluster with max map tasks set to 10 on a 16 core processor machine. After a while the tasks begin to fail with the following exception log. 2011-01-01 03:17:52,149 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=temp,temp ip=/x.x.x.x cmd=delete src=/TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2 dst=nullperm=null 2011-01-01 03:17:52,156 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_7046642930904717718_23143 on x.x.x.x: size 66148 But it does not belong to any file. 2011-01-01 03:17:52,156 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: failed to complete /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0 because dir.getFileBlocks() is null and pendingFile is null 2011-01-01 03:17:52,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 12 on 9000, call complete(/TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0, DFSClient_attempt_201101010256_0006_m_00_2) from x.x.x.x: error: java.io.IOException: Could not complete write to file /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0 by DFSClient_attempt_201101010256_0006_m_00_2 java.io.IOException: Could not complete write to file /TestMultipleOuputs1320394241986/_temporary/_attempt_201101010256_0006_m_00_2/Output0-m-0 by DFSClient_attempt_201101010256_0006_m_00_2 at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:497) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962) Looks like there's a delete command issued by FsNameSystem.audit before the it errors out stating it could not complete write to the file inside that.. Any clue on what could have gone wrong? Thanks Sudharsan S