On Mon, Mar 16, 2009 at 9:36 AM, Steve Loughran <[email protected]> wrote:

> Owen O'Malley wrote:
>
>> On Mar 16, 2009, at 4:29 AM, Steve Loughran wrote:
>>
>>  I spoke with someone from the local university on their High Energy
>>> Physics problems last week -their single event files are about 2GB, so
>>> that's the only sensible block size to use when scheduling work. He'll be at
>>> ApacheCon next week, to make his use cases known.
>>>
>>
>> I don't follow. Not all files need to be 1 block long. If your files are
>> 2GB, 1GB blocks should be fine and I've personally tested those when I've
>> wanted to have longer maps. (The block size of a dataset is the natural size
>> of the input for each map.)
>>
>
> within a single 2GB event, data access is very random; you'd need all 2GB
> on a single machine and efficient random-access within it. The natural size
> for each map -and hence block- really is 2GB.
>

To me, this suggests that the map function should copy the file to known
local storage in order to get good random access performance.  In fact, it
suggests that the single event should be in RAM.

Which makes the block size almost irrelevant especially if a somewhat
smaller block size allows better average locality.

Reply via email to