On Mar 16, 2009, at 11:03 AM, Owen O'Malley wrote:

On Mar 16, 2009, at 4:29 AM, Steve Loughran wrote:

I spoke with someone from the local university on their High Energy Physics problems last week -their single event files are about 2GB, so that's the only sensible block size to use when scheduling work. He'll be at ApacheCon next week, to make his use cases known.

I don't follow. Not all files need to be 1 block long. If your files are 2GB, 1GB blocks should be fine and I've personally tested those when I've wanted to have longer maps. (The block size of a dataset is the natural size of the input for each map.)


Hm ... I work on the same project and I'm not sure I agree with this statement.

The problem is that the files contain independent event data from a particle detector (about 1 - 2MB / event). However, the file organization is such that it's not possible to split the file at this point (not to mention that it takes quite some overhead to startup the process)

Turning the block size way up would mean that any jobs could keep data access completely node-local. OTOH, this probably defeats one of the best advantages for using HDFS: block-decomposition mostly solves the "hot spot" issue. Ever seen what happens to a file system when a user submits 1000 jobs to analyze a single 2GB file? Without block- decomposition to spread the reads over 20 or so servers, with only one block per file, the read happens to 1-3 servers. Big difference.

Brian

Reply via email to