Re: 0.19.1 infinite loop

Brian Bockelman Wed, 26 Aug 2009 06:55:03 -0700

Hey Jeremy,

Glad someone else has run into this!

I always thought this specific infinite loop was in my code. I had an issue open for it earlier, but I ultimately was not sure if it was in my code or HDFS, so we closed it:


https://issues.apache.org/jira/browse/HADOOP-4866

We [and others] get these daily. It would be nice to figure out a way to replicate this.


Brian

On Aug 26, 2009, at 8:27 AM, Jeremy Pinkham wrote:

I'm using hadoop 0.19.1 on a 60 node cluster, each node has 8GB of ram
and 4 cores. I have several jobs that run every day, and last night one of them triggered an infinite loop that rendered the cluster inoperable.
As the job finishes, the following is logged to the job tracker logs:



2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress:
Task 'attempt_200908220740_0126_r_000001_0' has completed
task_200908220740_0126_r_000001 successfully.
2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Job
job_200908220740_0126 has completed successfully.
2009-08-25 22:08:09,897 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete file
/proc/statpump/incremental/200908260200/_logs/history/dup- jt_12509412317
25_job_200908220740_0126_hadoop_statpump-incremental retrying...



That last line, "Could not complete file..." then repeats forever, at
which point the job tracker UI stops responding and no more tasks will
run.  The only way to free things up is to restart the jobtracker



Both prior to and during the infinite loop, I see this in the namenode
logs. Because it starts long before the inifinte loop I can't tell for
sure if it's related, and it is still happening now even after the
restart and with jobs finishing without issue



2009-08-25 22:08:05,760 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 54310, call
nextGenerationStamp(blk_2796235715791117970_4385127) from
172.21.30.2:48164: error: java.io.IOException:
blk_2796235715791117970_4385127 is already commited, storedBlock ==
null.

java.io.IOException: blk_2796235715791117970_4385127 is already
commited, storedBlock == null.

       at
org .apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampF
orBlock(FSNamesystem.java:4552)

       at
org .apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(Name
Node.java:402)

       at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)

       at
sun .reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)

       at java.lang.reflect.Method.invoke(Method.java:597)

       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)

       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)



And finally, this warning appears in the namenode logs just prior as
well



2009-08-25 22:07:22,580 WARN
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size
for block blk_-1458477261945758787_4416123 reported from
172.21.30.4:50010 current size is 5396992 reported size is 67108864



Can anyone point me in a direction to determine what's going here?



Thanks
The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.

smime.p7s
Description: S/MIME cryptographic signature

Re: 0.19.1 infinite loop

Reply via email to