Currently each gzip file is about 250MB (*60files=15G) so we have 256M blocks.
However I understand that in order to utilize better M/R parallel processing smaller files/blocks are better. So maybe having 128M gzip files with coreesponding 128M block size would be better? On Thu, Mar 17, 2011 at 4:05 PM, Harsh J <qwertyman...@gmail.com> wrote: > On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter <li...@infolinks.com> > wrote: > > Hi, > > If I have is big gzip files (>>block size) does the M/R will split a > single > > file to multiple blocks and send them to different mappers ? > > The behavior I currently see is that a map is still open per file (and > not > > per block). > > Yes this is true. This is the current behavior with GZip files (since > they can't be split and decompressed right out). I had somehow managed > to ignore the GZIP part of your question in the previous thread! > > But still, 60~ files worth 15 GB total would mean at least 3 GB per > file. And seeing how they can't really be split out right now, it > would be good to have them use up only a single block. Perhaps for > these files alone you may use a block size of 3-4 GB, thereby making > these file reads more local for your record readers? > > In future, HADOOP-7076 plans to add a pseudo-splitting way for plain > GZIP files, though. 'Concatenated' GZIP files could be split > (HADOOP-6835) across mappers as well (as demonstrated in PIG-42). > > -- > Harsh J > http://harshj.com >