Re: Why single thread for HDFS?

Todd Lipcon Mon, 05 Jul 2010 10:02:47 -0700

On Mon, Jul 5, 2010 at 5:08 AM, elton sky <eltonsky9...@gmail.com> wrote:


> Segel, Jay
> Thanks for reply!
>
> >Your parallelism comes from multiple tasks running on different nodes
> within the cloud. By >default you get one map/reduce job per block. You can
> write your own splitter to increase >this and then get more parallelism.
> sounds like an elegant solution. We can modify the 'distcp', using a simple
> MR job, make it based on block rather than file.
>

There's actually an open ticket somewhere to make distcp do this using the
new concat() API in the NameNode. concat() allows several files to be
combined into one file at the metadata level, so long as a number of
restrictions are met. The work hasn't been done yet, but the concat() call
is there and waiting for a user.

-Todd


>
> >in practice, you very rarely know how big your output is going to be
> before
> it's produced, so >this doesn't really work
> I think you got the point why Yahoo make this design descision.
> Multithreading only applicable when you know the size of the file, like
> copy
> existing files, so you can split them and feed to different threads.
>
> On Sat, Jul 3, 2010 at 1:24 AM, Jay Booth <jaybo...@gmail.com> wrote:
>
> > Yeah, a good way to think of it is that parallelism is achieved at the
> > application level.
> >
> > On the input side, you can process multiple files in parallel or one
> > file in parallel by logically splitting and opening multiple readers
> > of the same file at multiple points.  Each of these readers is single
> > threaded, because, well, you're returning a stream of bytes in order.
> > It's inherently serial.
> >
> > On the reduce side, multiple reduces run, writing to multiple files in
> > the same directory.  Again, you can't really write to a single file in
> > parallel effectively -- you can't write byte 26 before byte 25,
> > because the file's not that long yet.
> >
> > Theoretically, maybe you could have all reduces write to the same file
> > by allocating some amount of space ahead of time and writing to the
> > blocks in parallel - in practice, you very rarely know how big your
> > output is going to be before it's produced, so this doesn't really
> > work.  Multiple files in the same directory achieves the same goal
> > much more elegantly, without exposing a bunch of internal details of
> > the filesystem to user space.
> >
> > Does that make sense?
> >
> >
> >
> > On Fri, Jul 2, 2010 at 9:26 AM, Segel, Mike <mse...@navteq.com> wrote:
> > > Actually they also listen here and this is a basic question...
> > >
> > > I'm not an expert, but how does having multiple threads really help
> this
> > problem?
> > >
> > > I'm assuming you're talking about a map/reduce job and not some
> specific
> > client code which is being run on a client outside of the
> cloud/cluster....
> > >
> > > I wasn't aware that you could easily synchronize threads running on
> > different JVMs. ;-)
> > >
> > > Your parallelism comes from multiple tasks running on different nodes
> > within the cloud. By default you get one map/reduce job per block. You
> can
> > write your own splitter to increase this and then get more parallelism.
> > >
> > > HTH
> > >
> > > -Mike
> > >
> > >
> > > -----Original Message-----
> > > From: Hemanth Yamijala [mailto:yhema...@gmail.com]
> > > Sent: Friday, July 02, 2010 2:56 AM
> > > To: general@hadoop.apache.org
> > > Subject: Re: Why single thread for HDFS?
> > >
> > > Hi,
> > >
> > > Can you please post this on hdfs-...@hadoop.apache.org ? I suspect the
> > > most qualified people to answer this question would all be on that
> > > list.
> > >
> > > Hemanth
> > >
> > > On Fri, Jul 2, 2010 at 11:43 AM, elton sky <eltonsky9...@gmail.com>
> > wrote:
> > >> I guess this question was igored, so I just post it again.
> > >>
> > >> From my understanding, HDFS uses a single thread to do read and write.
> > >> Since a file is composed of many blocks and each block is stored as a
> > file
> > >> in the underlying FS, we can do some parallelism on block base.
> > >> When read across multi-blocks, threads can be used to read all blocks.
> > When
> > >> write, we can calculate the offset of each block and write to all of
> > them
> > >> simultaneously.
> > >>
> > >> Is this right?
> > >>
> > >
> > >
> > > The information contained in this communication may be CONFIDENTIAL and
> > is intended only for the use of the recipient(s) named above.  If you are
> > not the intended recipient, you are hereby notified that any
> dissemination,
> > distribution, or copying of this communication, or any of its contents,
> is
> > strictly prohibited.  If you have received this communication in error,
> > please notify the sender and delete/destroy the original message and any
> > copy of it from your computer or paper files.
> > >
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Why single thread for HDFS?

Reply via email to