Re: dfs create block sticking

Jason Venner Tue, 29 Sep 2009 07:15:21 -0700

I had a problem like that with a custom record writer - solr-1301

On Mon, Sep 28, 2009 at 11:18 PM, Chandraprakash Bhagtani <
[email protected]> wrote:


> I faced the org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException
> exception once.
> What I was doing that I was overriding FileOutputFormat in some class and
> in
> that I had
> opend a file stream. I did this because I needed only a single file as an
> output. It was working
> fine when I had only one reducer. But when I increased the number of
> reducers, every reducer
> was trying to create/use a file with the same name, therefore I got
> AlreadyBeingCreatedException.
>
> your case may be different, but I thought to share mine.
>
> On Tue, Sep 29, 2009 at 11:03 AM, Jason Venner <[email protected]
> >wrote:
>
> > How long does it take you to create a file in on one of your datanodes,
> in
> > the dfs block storage area, while your job is running, it could simply be
> > that the OS level file creation is taking longer than the RPC timeout.
> >
> > On Mon, Sep 28, 2009 at 5:30 PM, dave bayer <[email protected]>
> > wrote:
> >
> > > On a cluster running 0.19.2
> > >
> > > We have some production jobs that perform ETL tasks that open files
> > > in hdfs during the reduce task (with speculative execution in reduce
> > stage
> > > programmatically turned off). Since upgrading the cluster from 0.19.1,
> > > we've
> > > been seeing some odd behavior in that we are experiencing timeouts with
> > > block/file creation, timeouts that are long enough that the reduce
> > attempt
> > > gets
> > > killed. Subsequent reduce attempts then fail because the first killed
> > > attempt
> > > is still noted (by the namenode I assume) to create the block/file
> > > according to
> > > the exception that bubbles up. Didn't see anything like this in JIRA,
> and
> > > I'm
> > > trying to grab a few jstacks from the namenode when I see these errors
> > pop
> > > up (usually correlated with a somewhat busy cluster) in an effort to
> get
> > > some
> > > idea of what is going on here.
> > >
> > > Currently the cluster is small with about 5 data nodes and 10s of TBs
> > with
> > > the 2x the namespace files easily fitting in memory.... I don't see any
> > > process
> > > eating more than a couple percent of cpu on the name node box (which
> > > also hosts the secondary nn). iostat shows 100-200 block read/written
> > every
> > > other second on this host leaving plenty of headroom there. The cluster
> > is
> > > scheduled to grow in the near future, which may worsen this
> hang/blocking
> > > if its due to a bottleneck.
> > >
> > > Before I start tracing through the code, I thought I might ask whether
> > > anyone
> > > has seen anything the exerts from the jobtracker logs below? Is there a
> > way
> > > to guarantee that all in processes takes for a given reduce task will
> be
> > > terminated (and any associated network connections be sent a reset or
> > > something) before a new reduce task is started.
> > >
> > > On kind of side thought - is the task attempt name in the jobconf that
> is
> > > handed
> > > to the reduce in configure() and if so - what might the setting name be
> > to
> > > get at
> > > it? Or does one need to go through a more circuitous route to obtain
> the
> > > TaskAttemptID associated with the attempt?
> > >
> > > Back to the point at hand, from the jobtracker logs:
> > >
> > > Failing initial reduce:
> > > ----------------------------
> > > 2009-09-27 22:24:25,056 INFO org.apache.hadoop.mapred.TaskInProgress:
> > Error
> > > from attempt_200909231347_0694_r_000002_0:
> > java.net.SocketTimeoutException:
> > > 69000 millis timeout while waiting for channel to be ready for read. ch
> :
> > > java.nio.channels.SocketChannel[connected local=/X.X.X.2:47440
> > > remote=/X.X.X.2:50010]
> > >       at
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162)
> > >       at
> > >
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
> > >       at
> > >
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
> > >       at
> > >
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:116)
> > >       at java.io.DataInputStream.readByte(DataInputStream.java:248)
> > >       at
> > > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
> > >       at
> > > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
> > >       at org.apache.hadoop.io.Text.readString(Text.java:400)
> > >       at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2787)
> > >       at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2712)
> > >       at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> > >      at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2182)
> > >
> > > Failing second reduce:
> > > -------------------------------
> > > 2009-09-27 22:53:22,048 INFO org.apache.hadoop.mapred.TaskInProgress:
> > Error
> > > from attempt_200909231347_0694_r_000002_3:
> > > org.apache.hadoop.ipc.RemoteException:
> > > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to
> > > create file >blah<
> > > for DFSClient_attempt_200909231347_0694_r_000002_3 on client X.X.X.7,
> > > because this file is already being created by
> > > DFSClient_attempt_200909231347_0694_r_000002_0 on X.X.X.2
> > >       at
> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal
> > > (FSNamesystem.java:1085)        at
> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNames
> > > ystem.java:998)        at
> > > org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:
> > > 301)
> > >       at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
> > >  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> > > sorImpl.java:25)
> > >       at java.lang.reflect.Method.invoke(Method.java:597)        at
> > > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
> > >       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
> > >
> > >       at org.apache.hadoop.ipc.Client.call(Client.java:697)
> > >       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
> > >       at $Proxy1.create(Unknown Source)
> > >       at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> > >       at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >       at java.lang.reflect.Method.invoke(Method.java:597)
> > >       at
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> > >       at
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> > >       at $Proxy1.create(Unknown Source)
> > >       at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:2594)
> > >       at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:454)
> > >       at
> > >
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:188)
> > >       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
> > >       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
> > >       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
> > >
> > >
> > > Many thanks...
> > >
> > > dave bayer
> > >
> >
> >
> >
> > --
> > Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> > http://www.amazon.com/dp/1430219424?tag=jewlerymall
> > www.prohadoopbook.com a community for Hadoop Professionals
> >
>
>
>
> --
> Thanks & Regards,
> Chandra Prakash Bhagtani,
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Re: dfs create block sticking

Reply via email to