Ted Dunning wrote:
I have to say, btw, that the source tree structure of this project is pretty
ornate and not very parallel. I needed to add 10 source roots in IntelliJ to
get a clean compile. In this process, I noticed some circular dependencies.
Would the committers be open to some small
--Original Message-
>From: Stu Hood [mailto:[EMAIL PROTECTED]
>Sent: Friday, August 31, 2007 2:23 PM
>To: hadoop-user@lucene.apache.org
>Subject: RE: Re: Compression using Hadoop...
>
>Isn't that what the distcp script does?
>
>Thanks,
>Stu
>
>
>-Original Mess
On Fri, Aug 31, 2007 at 10:43:09AM -0700, Doug Cutting wrote:
>Arun C Murthy wrote:
>>One way to reap benefits of both compression and better parallelism is to
>>use compressed SequenceFiles:
>>http://wiki.apache.org/lucene-hadoop/SequenceFile
>>
>>Of course this means you will have to do a conve
using Hadoop...
Isn't that what the distcp script does?
Thanks,
Stu
-Original Message-
From: Joydeep Sen Sarma
Sent: Friday, August 31, 2007 3:58pm
To: hadoop-user@lucene.apache.org
Subject: Re: Compression using Hadoop...
One thing I had done to speed up copy/put speeds was write a s
Subject: Re: Compression using Hadoop...
On 8/31/07 10:43 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:
>
> We really need someone to contribute an InputFormat for bzip files.
> This has come up before: bzip is a standard compression format that is
> splittable.
+1
-
Isn't that what the distcp script does?
Thanks,
Stu
-Original Message-
From: Joydeep Sen Sarma
Sent: Friday, August 31, 2007 3:58pm
To: hadoop-user@lucene.apache.org
Subject: Re: Compression using Hadoop...
One thing I had done to speed up copy/put speeds was write a simple
map-r
share as well.
-Original Message-
From: C G [mailto:[EMAIL PROTECTED]
Sent: Friday, August 31, 2007 11:21 AM
To: hadoop-user@lucene.apache.org
Subject: RE: Compression using Hadoop...
My input is typical row-based stuff across which are run a large stack
of aggregations/rollups. After re
On 8/31/07 10:43 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:
>
> We really need someone to contribute an InputFormat for bzip files.
> This has come up before: bzip is a standard compression format that is
> splittable.
+1
- milind
--
Milind Bhandarkar
408-349-2136
([EMAIL PROTECTED])
ginal Message-
From: C G [mailto:[EMAIL PROTECTED]
Sent: Fri 8/31/2007 11:21 AM
To: hadoop-user@lucene.apache.org
Subject: RE: Compression using Hadoop...
> Ted, from what you are saying I should be using at least 80 files given the
> cluster size, and I should modify the loader to be awar
m: [EMAIL PROTECTED] on behalf of jason gessner
Sent: Fri 8/31/2007 9:38 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Compression using Hadoop...
ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?
C G, glad i could help a little.
-jason
Arun C Murthy wrote:
One way to reap benefits of both compression and better parallelism is to use
compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile
Of course this means you will have to do a conversion from .gzip to .seq file
and load it onto hdfs for your job, which
gt;-Original Message-
>From: [EMAIL PROTECTED] on behalf of jason gessner
>Sent: Fri 8/31/2007 9:38 AM
>To: hadoop-user@lucene.apache.org
>Subject: Re: Compression using Hadoop...
>
>ted, will the gzip files be a non-issue as far as splitting goes if
>they are under the default b
ser@lucene.apache.org
Subject: Re: Compression using Hadoop...
ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?
C G, glad i could help a little.
-jason
On 8/31/07, C G <[EMAIL PROTECTED]> wrote:
> Thanks Ted and Jason for your co
ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?
C G, glad i could help a little.
-jason
On 8/31/07, C G <[EMAIL PROTECTED]> wrote:
> Thanks Ted and Jason for your comments. Ted, your comments about gzip not
> being splittable was very
Thanks Ted and Jason for your comments. Ted, your comments about gzip not
being splittable was very timely...I'm watching my 8 node cluster saturate one
node (with one gz file) and was wondering why. Thanks for the "answer in
advance" :-).
Ted Dunning <[EMAIL PROTECTED]> wrote:
With gzipped
With gzipped files, you do face the problem that your parallelism in the map
phase is pretty much limited to the number of files you have (because
gzip'ed files aren't splittable). This is often not a problem since most
people can arrange to have dozens to hundreds of input files easier than
they
if you put .gz files up on your HDFS cluster you don't need to do
anything to read them. I see lots of extra control via the API, but i
have simply put the files up and run my jobs on them.
-jason
On 8/30/07, C G <[EMAIL PROTECTED]> wrote:
> Hello All:
>
> I think I must be missing something f
17 matches
Mail list logo