Re: large block size problem

Brian Bockelman Mon, 16 Mar 2009 10:56:06 -0700


On Mar 16, 2009, at 11:03 AM, Owen O'Malley wrote:

On Mar 16, 2009, at 4:29 AM, Steve Loughran wrote:
I spoke with someone from the local university on their High EnergyPhysics problems last week -their single event files are about 2GB,so that's the only sensible block size to use when scheduling work.He'll be at ApacheCon next week, to make his use cases known.
I don't follow. Not all files need to be 1 block long. If your filesare 2GB, 1GB blocks should be fine and I've personally tested thosewhen I've wanted to have longer maps. (The block size of a datasetis the natural size of the input for each map.)

Hm ... I work on the same project and I'm not sure I agree with thisstatement.

The problem is that the files contain independent event data from aparticle detector (about 1 - 2MB / event). However, the fileorganization is such that it's not possible to split the file at thispoint (not to mention that it takes quite some overhead to startup theprocess)

Turning the block size way up would mean that any jobs could keep dataaccess completely node-local. OTOH, this probably defeats one of thebest advantages for using HDFS: block-decomposition mostly solves the"hot spot" issue. Ever seen what happens to a file system when a usersubmits 1000 jobs to analyze a single 2GB file? Without block-decomposition to spread the reads over 20 or so servers, with only oneblock per file, the read happens to 1-3 servers. Big difference.

Brian

Re: large block size problem

Reply via email to