They will only be a non-issue if you have enough of them to get the parallelism you want. If you have number of gzip files > 10*number of task nodes you should be fine.
-----Original Message----- From: [EMAIL PROTECTED] on behalf of jason gessner Sent: Fri 8/31/2007 9:38 AM To: hadoop-user@lucene.apache.org Subject: Re: Compression using Hadoop... ted, will the gzip files be a non-issue as far as splitting goes if they are under the default block size? C G, glad i could help a little. -jason On 8/31/07, C G <[EMAIL PROTECTED]> wrote: > Thanks Ted and Jason for your comments. Ted, your comments about gzip not > being splittable was very timely...I'm watching my 8 node cluster saturate > one node (with one gz file) and was wondering why. Thanks for the "answer in > advance" :-). > > Ted Dunning <[EMAIL PROTECTED]> wrote: > With gzipped files, you do face the problem that your parallelism in the map > phase is pretty much limited to the number of files you have (because > gzip'ed files aren't splittable). This is often not a problem since most > people can arrange to have dozens to hundreds of input files easier than > they can arrange to have dozens to hundreds of CPU cores working on their > data.