I'm using hadoop 0.19.1 on a 60 node cluster, each node has 8GB of ram and 4 cores. I have several jobs that run every day, and last night one of them triggered an infinite loop that rendered the cluster inoperable. As the job finishes, the following is logged to the job tracker logs:
2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200908220740_0126_r_000001_0' has completed task_200908220740_0126_r_000001 successfully. 2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Job job_200908220740_0126 has completed successfully. 2009-08-25 22:08:09,897 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /proc/statpump/incremental/200908260200/_logs/history/dup-jt_12509412317 25_job_200908220740_0126_hadoop_statpump-incremental retrying... That last line, "Could not complete file..." then repeats forever, at which point the job tracker UI stops responding and no more tasks will run. The only way to free things up is to restart the jobtracker Both prior to and during the infinite loop, I see this in the namenode logs. Because it starts long before the inifinte loop I can't tell for sure if it's related, and it is still happening now even after the restart and with jobs finishing without issue 2009-08-25 22:08:05,760 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 54310, call nextGenerationStamp(blk_2796235715791117970_4385127) from 172.21.30.2:48164: error: java.io.IOException: blk_2796235715791117970_4385127 is already commited, storedBlock == null. java.io.IOException: blk_2796235715791117970_4385127 is already commited, storedBlock == null. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampF orBlock(FSNamesystem.java:4552) at org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(Name Node.java:402) at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) And finally, this warning appears in the namenode logs just prior as well 2009-08-25 22:07:22,580 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size for block blk_-1458477261945758787_4416123 reported from 172.21.30.4:50010 current size is 5396992 reported size is 67108864 Can anyone point me in a direction to determine what's going here? Thanks The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.