Owen O'Malley wrote:
On Mar 16, 2009, at 4:29 AM, Steve Loughran wrote:
I spoke with someone from the local university on their High Energy
Physics problems last week -their single event files are about 2GB, so
that's the only sensible block size to use when scheduling work. He'll
be at ApacheCon next week, to make his use cases known.
I don't follow. Not all files need to be 1 block long. If your files are
2GB, 1GB blocks should be fine and I've personally tested those when
I've wanted to have longer maps. (The block size of a dataset is the
natural size of the input for each map.)
within a single 2GB event, data access is very random; you'd need all
2GB on a single machine and efficient random-access within it. The
natural size for each map -and hence block- really is 2GB.