On Nov 24, 2009, at 12:36 PM, Todd Lipcon wrote:

> On Tue, Nov 24, 2009 at 10:33 AM, Brian Bockelman <bbock...@cse.unl.edu>wrote:
> 
>> 
>> On Nov 24, 2009, at 12:06 PM, Todd Lipcon wrote:
>> 
>>> Also, keep in mind that, when you open a block for reading, the DN
>>> immediately starts writing the entire block (assuming it's requested via
>> the
>>> xceiver protocol) - it's TCP backpressure on the send window that does
>> flow
>>> control there.
>> 
>> Ok, that's a pretty freakin' cool idea.  Is it well-documented how this
>> technique works?  How does this affect folks (me) who use the pread
>> interface?
>> 
> 
> AFAIK using pread sends the actual length with the OP_READ_BLOCK command, so
> it doesn't read ahead past what you ask for. The awful thing about pread is
> that it actually makes a new datanode connection for every read - including
> the TCP handshake round trip, thread setup/teardown, etc.
> 

I'm not going to argue with the fact that we can do better here, but it's not 
as bad as you think for our particular workflow.  Our random reads are "truly 
random"; i.e., there are approximately zero repeated requests of data.  Hence, 
the 1ms of overhead is pretty negligible compared to spinning a hard drive 
(10ms when the cluster is idle, 30ms when we're pounding it).

In future versions of our software, we've made things at least "monotonically 
increasing".  I.e., with a few exceptions, every position is strictly greater 
than the position of the last read.  (It doesn't mean we can sequentially read 
out the file; our reads can be quite sparse, only taking 10% of the file; if we 
read things sequentially, we'd overread by a factor of 10, and that can start 
to hit network limitations).

At some point, I need to do a talk or write-up of the column-oriented 
techniques that HEP folks do; after all, they've been doing column-oriented 
stores for the past 20 years or so.  They have some tricks up their sleeves, 
and it would be interesting to compare notes.

Brian

> 
>> 
>>> So, although it's not explicitly reading ahead, most of the
>>> reads on DFSInputStream should be coming from the TCP receive buffer, not
>>> making round trips.
>>> 
>>> At one point a few weeks ago I did hack explicit readahead around
>>> DFSInputStream and didn't see an appreciable difference. I didn't spend
>> much
>>> time on it, though, so I may have screwed something up - wasn't a
>> scientific
>>> test.
>>> 
>> 
>> Speaking from someone who's worked with storage systems that do an explicit
>> readahead, this can turn out to be a big giant disaster if it's combined
>> with random reads.
>> 
>> Big disaster as far as application-level throughput goes - but does make
>> for impressive ganglia graphs!
>> 
>> Brian
>> 
>>> -Todd
>>> 
>>> On Tue, Nov 24, 2009 at 10:02 AM, Eli Collins <e...@cloudera.com> wrote:
>>> 
>>>> Hey Martin,
>>>> 
>>>> It would be an interesting experiment but I'm not sure it would
>>>> improve things as the host (and hardware to some extent) are already
>>>> reading ahead. A useful exercise would be to evaluate whether the new
>>>> default host parameters for on-demand readahead are suitable for
>>>> hadoop.
>>>> 
>>>> http://lwn.net/Articles/235164
>>>> http://lwn.net/Articles/235181
>>>> 
>>>> Thanks,
>>>> Eli
>>>> 
>>>> On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas <
>> xietao1...@hotmail.com>
>>>> wrote:
>>>>> 
>>>>> I read the code and find the call
>>>>> DFSInputStream.read(buf, off, len)
>>>>> will cause the DataNode read len bytes (or less if encounting the end
>> of
>>>>> block) , why does not hdfs read ahead to improve performance for
>>>> sequential
>>>>> read?
>>>>> --
>>>>> View this message in context:
>>>> 
>> http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
>>>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>>> 
>>>>> 
>>>> 
>> 
>> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to