On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter <li...@infolinks.com> wrote: > Hi, > If I have is big gzip files (>>block size) does the M/R will split a single > file to multiple blocks and send them to different mappers ? > The behavior I currently see is that a map is still open per file (and not > per block).
Yes this is true. This is the current behavior with GZip files (since they can't be split and decompressed right out). I had somehow managed to ignore the GZIP part of your question in the previous thread! But still, 60~ files worth 15 GB total would mean at least 3 GB per file. And seeing how they can't really be split out right now, it would be good to have them use up only a single block. Perhaps for these files alone you may use a block size of 3-4 GB, thereby making these file reads more local for your record readers? In future, HADOOP-7076 plans to add a pseudo-splitting way for plain GZIP files, though. 'Concatenated' GZIP files could be split (HADOOP-6835) across mappers as well (as demonstrated in PIG-42). -- Harsh J http://harshj.com