I'm using hadoop 0.19.1 on a 60 node cluster, each node has 8GB of ram
and 4 cores.  I have several jobs that run every day, and last night one
of them triggered an infinite loop that rendered the cluster inoperable.
As the job finishes, the following is logged to the job tracker logs:

 

2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress:
Task 'attempt_200908220740_0126_r_000001_0' has completed
task_200908220740_0126_r_000001 successfully.

2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress: Job
job_200908220740_0126 has completed successfully.

2009-08-25 22:08:09,897 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete file
/proc/statpump/incremental/200908260200/_logs/history/dup-jt_12509412317
25_job_200908220740_0126_hadoop_statpump-incremental retrying...

 

That last line, "Could not complete file..." then repeats forever, at
which point the job tracker UI stops responding and no more tasks will
run.  The only way to free things up is to restart the jobtracker

 

Both prior to and during the infinite loop, I see this in the namenode
logs.  Because it starts long before the inifinte loop I can't tell for
sure if it's related, and it is still happening now even after the
restart and with jobs finishing without issue

 

2009-08-25 22:08:05,760 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 54310, call
nextGenerationStamp(blk_2796235715791117970_4385127) from
172.21.30.2:48164: error: java.io.IOException:
blk_2796235715791117970_4385127 is already commited, storedBlock ==
null.

java.io.IOException: blk_2796235715791117970_4385127 is already
commited, storedBlock == null.

        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampF
orBlock(FSNamesystem.java:4552)

        at
org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(Name
Node.java:402)

        at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)

        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)

 

And finally, this warning appears in the namenode logs just prior as
well

 

2009-08-25 22:07:22,580 WARN
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size
for block blk_-1458477261945758787_4416123 reported from
172.21.30.4:50010 current size is 5396992 reported size is 67108864

 

Can anyone point me in a direction to determine what's going here?

 

Thanks



The information transmitted in this email is intended only for the person(s) or 
entity to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this email in error, 
please contact the sender and permanently delete the email from any computer.

Reply via email to