I went ahead and created a JIRA HADOOP-5552, created a unit test that demonstrates this bug and a first version of a patch. I suspect that the patch needs some more work. If somebody wants to extend this patch to make the unit test pass, that would be awesome.
thanks, dhruba http://issues.apache.org/jira/browse/HADOOP-5552 On Mon, Mar 16, 2009 at 10:55 AM, Brian Bockelman <[email protected]>wrote: > > On Mar 16, 2009, at 11:03 AM, Owen O'Malley wrote: > > On Mar 16, 2009, at 4:29 AM, Steve Loughran wrote: >> >> I spoke with someone from the local university on their High Energy >>> Physics problems last week -their single event files are about 2GB, so >>> that's the only sensible block size to use when scheduling work. He'll be at >>> ApacheCon next week, to make his use cases known. >>> >> >> I don't follow. Not all files need to be 1 block long. If your files are >> 2GB, 1GB blocks should be fine and I've personally tested those when I've >> wanted to have longer maps. (The block size of a dataset is the natural size >> of the input for each map.) >> >> > Hm ... I work on the same project and I'm not sure I agree with this > statement. > > The problem is that the files contain independent event data from a > particle detector (about 1 - 2MB / event). However, the file organization > is such that it's not possible to split the file at this point (not to > mention that it takes quite some overhead to startup the process) > > Turning the block size way up would mean that any jobs could keep data > access completely node-local. OTOH, this probably defeats one of the best > advantages for using HDFS: block-decomposition mostly solves the "hot spot" > issue. Ever seen what happens to a file system when a user submits 1000 > jobs to analyze a single 2GB file? Without block-decomposition to spread > the reads over 20 or so servers, with only one block per file, the read > happens to 1-3 servers. Big difference. > > Brian >
