Re: Why single thread for HDFS?

Bardia Afshin Sun, 04 Jul 2010 22:49:42 -0700

What's the unsubcribe link?

Sent from my iPhone


On Jul 2, 2010, at 8:24 AM, Jay Booth <jaybo...@gmail.com> wrote:

Yeah, a good way to think of it is that parallelism is achieved at the
application level.

On the input side, you can process multiple files in parallel or one
file in parallel by logically splitting and opening multiple readers
of the same file at multiple points.  Each of these readers is single
threaded, because, well, you're returning a stream of bytes in order.
It's inherently serial.

On the reduce side, multiple reduces run, writing to multiple files in
the same directory.  Again, you can't really write to a single file in
parallel effectively -- you can't write byte 26 before byte 25,
because the file's not that long yet.

Theoretically, maybe you could have all reduces write to the same file
by allocating some amount of space ahead of time and writing to the
blocks in parallel - in practice, you very rarely know how big your
output is going to be before it's produced, so this doesn't really
work.  Multiple files in the same directory achieves the same goal
much more elegantly, without exposing a bunch of internal details of
the filesystem to user space.

Does that make sense?



On Fri, Jul 2, 2010 at 9:26 AM, Segel, Mike <mse...@navteq.com> wrote:
Actually they also listen here and this is a basic question...
I'm not an expert, but how does having multiple threads really helpthis problem?
I'm assuming you're talking about a map/reduce job and not somespecific client code which is being run on a client outside of thecloud/cluster....
I wasn't aware that you could easily synchronize threads running ondifferent JVMs. ;-)
Your parallelism comes from multiple tasks running on differentnodes within the cloud. By default you get one map/reduce job perblock. You can write your own splitter to increase this and thenget more parallelism.
HTH

-Mike


-----Original Message-----
From: Hemanth Yamijala [mailto:yhema...@gmail.com]
Sent: Friday, July 02, 2010 2:56 AM
To: general@hadoop.apache.org
Subject: Re: Why single thread for HDFS?

Hi,
Can you please post this on hdfs-...@hadoop.apache.org ? I suspectthe
most qualified people to answer this question would all be on that
list.

Hemanth
On Fri, Jul 2, 2010 at 11:43 AM, elton sky <eltonsky9...@gmail.com>wrote:
I guess this question was igored, so I just post it again.
From my understanding, HDFS uses a single thread to do read andwrite.Since a file is composed of many blocks and each block is storedas a file
in the underlying FS, we can do some parallelism on block base.
When read across multi-blocks, threads can be used to read allblocks. Whenwrite, we can calculate the offset of each block and write to allof them
simultaneously.

Is this right?
The information contained in this communication may be CONFIDENTIALand is intended only for the use of the recipient(s) named above.If you are not the intended recipient, you are hereby notified thatany dissemination, distribution, or copying of this communication,or any of its contents, is strictly prohibited. If you havereceived this communication in error, please notify the sender anddelete/destroy the original message and any copy of it from yourcomputer or paper files.

Re: Why single thread for HDFS?

Reply via email to