Thanks Brian.  I'm trying to find a way to reliably replicate it, and
will certainly update this list if I manage to do so.  It is happening
with more frequency in our QA environment, which is a much smaller
cluster (only 2 nodes), but still not deterministically.  Hopefully we
can hone in on something.

-----Original Message-----
From: Brian Bockelman [mailto:bbock...@cse.unl.edu] 
Sent: Wednesday, August 26, 2009 9:54 AM
To: common-user@hadoop.apache.org
Subject: Re: 0.19.1 infinite loop

Hey Jeremy,

Glad someone else has run into this!

I always thought this specific infinite loop was in my code.  I had an  
issue open for it earlier, but I ultimately was not sure if it was in  
my code or HDFS, so we closed it:

https://issues.apache.org/jira/browse/HADOOP-4866

We [and others] get these daily.  It would be nice to figure out a way  
to replicate this.

Brian

On Aug 26, 2009, at 8:27 AM, Jeremy Pinkham wrote:

> I'm using hadoop 0.19.1 on a 60 node cluster, each node has 8GB of ram
> and 4 cores.  I have several jobs that run every day, and last night  
> one
> of them triggered an infinite loop that rendered the cluster  
> inoperable.
> As the job finishes, the following is logged to the job tracker logs:
>
>
>
> 2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress:
> Task 'attempt_200908220740_0126_r_000001_0' has completed
> task_200908220740_0126_r_000001 successfully.
>
> 2009-08-25 22:08:04,633 INFO org.apache.hadoop.mapred.JobInProgress:  
> Job
> job_200908220740_0126 has completed successfully.
>
> 2009-08-25 22:08:09,897 INFO org.apache.hadoop.hdfs.DFSClient: Could  
> not
> complete file
> /proc/statpump/incremental/200908260200/_logs/history/dup- 
> jt_12509412317
> 25_job_200908220740_0126_hadoop_statpump-incremental retrying...
>
>
>
> That last line, "Could not complete file..." then repeats forever, at
> which point the job tracker UI stops responding and no more tasks will
> run.  The only way to free things up is to restart the jobtracker
>
>
>
> Both prior to and during the infinite loop, I see this in the namenode
> logs.  Because it starts long before the inifinte loop I can't tell  
> for
> sure if it's related, and it is still happening now even after the
> restart and with jobs finishing without issue
>
>
>
> 2009-08-25 22:08:05,760 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 5 on 54310, call
> nextGenerationStamp(blk_2796235715791117970_4385127) from
> 172.21.30.2:48164: error: java.io.IOException:
> blk_2796235715791117970_4385127 is already commited, storedBlock ==
> null.
>
> java.io.IOException: blk_2796235715791117970_4385127 is already
> commited, storedBlock == null.
>
>        at
> org 
> .apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampF
> orBlock(FSNamesystem.java:4552)
>
>        at
> org 
> .apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(Name
> Node.java:402)
>
>        at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
>
>        at
> sun 
> .reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
> Impl.java:25)
>
>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>
>
>
> And finally, this warning appears in the namenode logs just prior as
> well
>
>
>
> 2009-08-25 22:07:22,580 WARN
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Inconsistent size
> for block blk_-1458477261945758787_4416123 reported from
> 172.21.30.4:50010 current size is 5396992 reported size is 67108864
>
>
>
> Can anyone point me in a direction to determine what's going here?
>
>
>
> Thanks
>
>
>
> The information transmitted in this email is intended only for the  
> person(s) or entity to which it is addressed and may contain  
> confidential and/or privileged material. Any review, retransmission,  
> dissemination or other use of, or taking of any action in reliance  
> upon, this information by persons or entities other than the  
> intended recipient is prohibited. If you received this email in  
> error, please contact the sender and permanently delete the email  
> from any computer.
>


The information transmitted in this email is intended only for the person(s) or 
entity to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this email in error, 
please contact the sender and permanently delete the email from any computer.


Reply via email to