Hey Jeremy, Glad someone else has run into this!
I always thought this specific infinite loop was in my code. I had an issue open for it earlier, but I ultimately was not sure if it was in my code or HDFS, so we closed it:
https://issues.apache.org/jira/browse/HADOOP-4866We [and others] get these daily. It would be nice to figure out a way to replicate this.
Brian On Aug 26, 2009, at 8:27 AM, Jeremy Pinkham wrote:
I'm using hadoop 0.19.1 on a 60 node cluster, each node has 8GB of ramand 4 cores. I have several jobs that run every day, and last night one of them triggered an infinite loop that rendered the cluster inoperable.As the job finishes, the following is logged to the job tracker logs: 2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200908220740_0126_r_000001_0' has completed task_200908220740_0126_r_000001 successfully.2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Jobjob_200908220740_0126 has completed successfully.2009-08-25 22:08:09,897 INFO org.apache.hadoop.hdfs.DFSClient: Could notcomplete file/proc/statpump/incremental/200908260200/_logs/history/dup- jt_1250941231725_job_200908220740_0126_hadoop_statpump-incremental retrying... That last line, "Could not complete file..." then repeats forever, at which point the job tracker UI stops responding and no more tasks will run. The only way to free things up is to restart the jobtracker Both prior to and during the infinite loop, I see this in the namenodelogs. Because it starts long before the inifinte loop I can't tell forsure if it's related, and it is still happening now even after the restart and with jobs finishing without issue 2009-08-25 22:08:05,760 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 54310, call nextGenerationStamp(blk_2796235715791117970_4385127) from 172.21.30.2:48164: error: java.io.IOException: blk_2796235715791117970_4385127 is already commited, storedBlock == null. java.io.IOException: blk_2796235715791117970_4385127 is already commited, storedBlock == null. atorg .apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:4552) atorg .apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:402) at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) atsun .reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) And finally, this warning appears in the namenode logs just prior as well 2009-08-25 22:07:22,580 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size for block blk_-1458477261945758787_4416123 reported from 172.21.30.4:50010 current size is 5396992 reported size is 67108864 Can anyone point me in a direction to determine what's going here? ThanksThe information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.
smime.p7s
Description: S/MIME cryptographic signature