Re: hadoop streaming and a directory containing large number of .tgz files

Raj Vishwanathan Tue, 24 Apr 2012 07:29:49 -0700

Sunil

You could use identity mappers, a single identity reducer and by not having 
output compression.,


Raj



>________________________________
> From: Sunil S Nandihalli <sunil.nandiha...@gmail.com>
>To: common-user@hadoop.apache.org 
>Sent: Tuesday, April 24, 2012 7:01 AM
>Subject: Re: hadoop streaming and a directory containing large number of .tgz 
>files
> 
>Sorry for reforwarding this email. I was not sure if it actually got
>through since I just got the confirmation regarding my membership to the
>mailing list.
>Thanks,
>Sunil.
>
>On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli <
>sunil.nandiha...@gmail.com> wrote:
>
>> Hi Everybody,
>>  I am a newbie to hadoop. I have about 40K .tgz files each of
>> approximately 3MB . I would like to process this as if it were a single
>> large file formed by
>> "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
>> how can I achieve this using hadoop-streaming or some-other similar
>> library..
>>
>>
>> thanks,
>> Sunil.
>>
>
>
>

Re: hadoop streaming and a directory containing large number of .tgz files

Reply via email to