심탁길 wrote:
I need to handle 4GB gzip style file. I thought that I could map-reduce even such a large gzip file in parallel. --; In reality, we should deal with the gzip style log file which is larger than the default block size(64M) and whenever face the situation , full-scanning and processing a large log file with only one commodity machine is not desirable is there any idea to solve this kind of issue ?
A gzip file with a single member must be processed by a single thread, since decompression must begin at the start of file. One gzip file with multiple members can be split, if the boundaries between members can be identified, either with an index or using a magic string indicating the start of each member.
Can you instead produce smaller, e.g., 100MB, gzipped inputs? Doug