Not in case of .gz files [Since there is no splitting done, the mapper shall possibly read 128 MB locally from a resident DN, and then could read the remaining 128 MB over the network from another DN if the next block does not reside on the same DN as well -- thereby introducing a network read cost].
On Thu, Mar 17, 2011 at 8:44 PM, Lior Schachter <li...@infolinks.com> wrote: > yes. but with 128M gzip files/block size the M/R will work better ? no ? > > anyhow, thanks for the useful information. > > On Thu, Mar 17, 2011 at 5:07 PM, Harsh J <qwertyman...@gmail.com> wrote: >> >> On Thu, Mar 17, 2011 at 7:51 PM, Lior Schachter <li...@infolinks.com> >> wrote: >> > Currently each gzip file is about 250MB (*60files=15G) so we have 256M >> > blocks. >> >> Darn, I ought to sleep a bit more. I did a file/gb and read it as gb/file >> mehh.. >> >> > >> > However I understand that in order to utilize better M/R parallel >> > processing >> > smaller files/blocks are better. >> >> Yes this is true in case of text/sequence files. >> >> > So maybe having 128M gzip files with coreesponding 128M block size would >> > be >> > better? >> >> Why not 256 for all your ~250MB _gzip_ files, making it nearly one >> block since they would not be split anyways? >> >> -- >> Harsh J >> http://harshj.com > > -- Harsh J http://harshj.com