I went ahead and created a JIRA HADOOP-5552, created a unit test that
demonstrates this bug and a first version of a patch. I suspect that the
patch needs some more work. If somebody wants to extend this patch to make
the unit test pass, that would be awesome.

thanks,
dhruba

http://issues.apache.org/jira/browse/HADOOP-5552

On Mon, Mar 16, 2009 at 10:55 AM, Brian Bockelman <[email protected]>wrote:

>
> On Mar 16, 2009, at 11:03 AM, Owen O'Malley wrote:
>
>  On Mar 16, 2009, at 4:29 AM, Steve Loughran wrote:
>>
>>  I spoke with someone from the local university on their High Energy
>>> Physics problems last week -their single event files are about 2GB, so
>>> that's the only sensible block size to use when scheduling work. He'll be at
>>> ApacheCon next week, to make his use cases known.
>>>
>>
>> I don't follow. Not all files need to be 1 block long. If your files are
>> 2GB, 1GB blocks should be fine and I've personally tested those when I've
>> wanted to have longer maps. (The block size of a dataset is the natural size
>> of the input for each map.)
>>
>>
> Hm ... I work on the same project and I'm not sure I agree with this
> statement.
>
> The problem is that the files contain independent event data from a
> particle detector (about 1 - 2MB / event).  However, the file organization
> is such that it's not possible to split the file at this point (not to
> mention that it takes quite some overhead to startup the process)
>
> Turning the block size way up would mean that any jobs could keep data
> access completely node-local.  OTOH, this probably defeats one of the best
> advantages for using HDFS: block-decomposition mostly solves the "hot spot"
> issue.  Ever seen what happens to a file system when a user submits 1000
> jobs to analyze a single 2GB file?  Without block-decomposition to spread
> the reads over 20 or so servers, with only one block per file, the read
> happens to 1-3 servers.  Big difference.
>
> Brian
>

Reply via email to