Re: Compression using Hadoop...

2007-09-04 Thread Doug Cutting
Ted Dunning wrote: I have to say, btw, that the source tree structure of this project is pretty ornate and not very parallel. I needed to add 10 source roots in IntelliJ to get a clean compile. In this process, I noticed some circular dependencies. Would the committers be open to some small

Re: Re: Compression using Hadoop...

2007-08-31 Thread Arun C Murthy
--Original Message- >From: Stu Hood [mailto:[EMAIL PROTECTED] >Sent: Friday, August 31, 2007 2:23 PM >To: hadoop-user@lucene.apache.org >Subject: RE: Re: Compression using Hadoop... > >Isn't that what the distcp script does? > >Thanks, >Stu > > >-Original Mess

Re: Compression using Hadoop...

2007-08-31 Thread Arun C Murthy
On Fri, Aug 31, 2007 at 10:43:09AM -0700, Doug Cutting wrote: >Arun C Murthy wrote: >>One way to reap benefits of both compression and better parallelism is to >>use compressed SequenceFiles: >>http://wiki.apache.org/lucene-hadoop/SequenceFile >> >>Of course this means you will have to do a conve

RE: Re: Compression using Hadoop...

2007-08-31 Thread Joydeep Sen Sarma
using Hadoop... Isn't that what the distcp script does? Thanks, Stu -Original Message- From: Joydeep Sen Sarma Sent: Friday, August 31, 2007 3:58pm To: hadoop-user@lucene.apache.org Subject: Re: Compression using Hadoop... One thing I had done to speed up copy/put speeds was write a s

RE: Compression using Hadoop...

2007-08-31 Thread Ted Dunning
Subject: Re: Compression using Hadoop... On 8/31/07 10:43 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: > > We really need someone to contribute an InputFormat for bzip files. > This has come up before: bzip is a standard compression format that is > splittable. +1 -

RE: Re: Compression using Hadoop...

2007-08-31 Thread Stu Hood
Isn't that what the distcp script does? Thanks, Stu -Original Message- From: Joydeep Sen Sarma Sent: Friday, August 31, 2007 3:58pm To: hadoop-user@lucene.apache.org Subject: Re: Compression using Hadoop... One thing I had done to speed up copy/put speeds was write a simple map-r

Re: Compression using Hadoop...

2007-08-31 Thread Joydeep Sen Sarma
share as well. -Original Message- From: C G [mailto:[EMAIL PROTECTED] Sent: Friday, August 31, 2007 11:21 AM To: hadoop-user@lucene.apache.org Subject: RE: Compression using Hadoop... My input is typical row-based stuff across which are run a large stack of aggregations/rollups. After re

Re: Compression using Hadoop...

2007-08-31 Thread Milind Bhandarkar
On 8/31/07 10:43 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: > > We really need someone to contribute an InputFormat for bzip files. > This has come up before: bzip is a standard compression format that is > splittable. +1 - milind -- Milind Bhandarkar 408-349-2136 ([EMAIL PROTECTED])

RE: Compression using Hadoop...

2007-08-31 Thread Ted Dunning
ginal Message- From: C G [mailto:[EMAIL PROTECTED] Sent: Fri 8/31/2007 11:21 AM To: hadoop-user@lucene.apache.org Subject: RE: Compression using Hadoop... > Ted, from what you are saying I should be using at least 80 files given the > cluster size, and I should modify the loader to be awar

RE: Compression using Hadoop...

2007-08-31 Thread C G
m: [EMAIL PROTECTED] on behalf of jason gessner Sent: Fri 8/31/2007 9:38 AM To: hadoop-user@lucene.apache.org Subject: Re: Compression using Hadoop... ted, will the gzip files be a non-issue as far as splitting goes if they are under the default block size? C G, glad i could help a little. -jason

Re: Compression using Hadoop...

2007-08-31 Thread Doug Cutting
Arun C Murthy wrote: One way to reap benefits of both compression and better parallelism is to use compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile Of course this means you will have to do a conversion from .gzip to .seq file and load it onto hdfs for your job, which

Re: Compression using Hadoop...

2007-08-31 Thread Arun C Murthy
gt;-Original Message- >From: [EMAIL PROTECTED] on behalf of jason gessner >Sent: Fri 8/31/2007 9:38 AM >To: hadoop-user@lucene.apache.org >Subject: Re: Compression using Hadoop... > >ted, will the gzip files be a non-issue as far as splitting goes if >they are under the default b

RE: Compression using Hadoop...

2007-08-31 Thread Ted Dunning
ser@lucene.apache.org Subject: Re: Compression using Hadoop... ted, will the gzip files be a non-issue as far as splitting goes if they are under the default block size? C G, glad i could help a little. -jason On 8/31/07, C G <[EMAIL PROTECTED]> wrote: > Thanks Ted and Jason for your co

Re: Compression using Hadoop...

2007-08-31 Thread jason gessner
ted, will the gzip files be a non-issue as far as splitting goes if they are under the default block size? C G, glad i could help a little. -jason On 8/31/07, C G <[EMAIL PROTECTED]> wrote: > Thanks Ted and Jason for your comments. Ted, your comments about gzip not > being splittable was very

Re: Compression using Hadoop...

2007-08-31 Thread C G
Thanks Ted and Jason for your comments. Ted, your comments about gzip not being splittable was very timely...I'm watching my 8 node cluster saturate one node (with one gz file) and was wondering why. Thanks for the "answer in advance" :-). Ted Dunning <[EMAIL PROTECTED]> wrote: With gzipped

Re: Compression using Hadoop...

2007-08-30 Thread Ted Dunning
With gzipped files, you do face the problem that your parallelism in the map phase is pretty much limited to the number of files you have (because gzip'ed files aren't splittable). This is often not a problem since most people can arrange to have dozens to hundreds of input files easier than they

Re: Compression using Hadoop...

2007-08-30 Thread jason gessner
if you put .gz files up on your HDFS cluster you don't need to do anything to read them. I see lots of extra control via the API, but i have simply put the files up and run my jobs on them. -jason On 8/30/07, C G <[EMAIL PROTECTED]> wrote: > Hello All: > > I think I must be missing something f