On Tue, May 27, 2008 at 10:49:38AM -0700, Ted Dunning wrote: > > There is a good tutorial on the wiki about this. > > Your problem here is that you have conflated two concepts. The first is the > splitting of files into blocks for storage purposes. This has nothing to do > with what data a program can read any more than splitting a file into blocks > on a disk in a conventional file system limits what you can read. The > second splitting concept is that the input format does in order to allow > parallelism. Basically, the file block splits have nothing to do with what > data the mapper can read. It only has to do with what data will be local. >
When reading from HDFS, how big are the network read requests, and what controls that? Or, more concretely, if I store files using 64Meg blocks in HDFS and run the simple word count example, and I get the default of one FileSplit/Map task per 64 meg block, how many bytes into the second 64meg block will a mapper read before it first passes a buffer up to the record reader to see if it has found an end-of-line? Thanks, -Erik