They will only be a non-issue if you have enough of them to get the parallelism 
you want.  If you have number of gzip files > 10*number of task nodes you 
should be fine.


-----Original Message-----
From: [EMAIL PROTECTED] on behalf of jason gessner
Sent: Fri 8/31/2007 9:38 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Compression using Hadoop...
 
ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?

C G, glad i could help a little.

-jason

On 8/31/07, C G <[EMAIL PROTECTED]> wrote:
> Thanks Ted and Jason for your comments.  Ted, your comments about gzip not 
> being splittable was very timely...I'm watching my 8 node cluster saturate 
> one node (with one gz file) and was wondering why.  Thanks for the "answer in 
> advance" :-).
>
> Ted Dunning <[EMAIL PROTECTED]> wrote:
> With gzipped files, you do face the problem that your parallelism in the map
> phase is pretty much limited to the number of files you have (because
> gzip'ed files aren't splittable). This is often not a problem since most
> people can arrange to have dozens to hundreds of input files easier than
> they can arrange to have dozens to hundreds of CPU cores working on their
> data.

Reply via email to