Re: Problem with large .lzo files

Todd Lipcon Mon, 15 Feb 2010 14:25:57 -0800

On Mon, Feb 15, 2010 at 8:07 AM, Steve Kuo <kuosen...@gmail.com> wrote:
> On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
> By the way, if all files have been indexed, DistributedLzoIndexer does not
> detect that and hadoop throws an exception complaining that the input dir
> (or file) does not exist.  I work around this by catching the exception.
>


Just fixed that in my github repo. Thanks for the bug report.

>
>> >   - It's possible to sacrifice parallelism by having hadoop work on each
>> >   .lzo file without indexing.  This worked well until the file size
>> exceeded
>> >   30G when array indexing exception got thrown.  Apparently the code
>> processed
>> >   the file in chunks and stored the references to the chunk in an array.
>>  When
>> >   the number of chunks was greater than a certain number (around 256 was
>> my
>> >   recollection), exception was thrown.
>> >   - My current work around is to increase the number of reducers to keep
>> >   the .lzo file size low.
>> >
>> > I would like to get advices on how people handle large .lzo files.  Any
>> > pointers on the cause of the stack trace below and best way to resolve it
>> > are greatly appreciated.
>> >
>>
>> Is this reproducible every time? If so, is it always at the same point
>> in the LZO file that it occurs?
>>
>> It's at the same point.  Do you know how to print out the lzo index for the
> task?  I only print out the input file now.
>

You should be able to downcast the InputSplit to FileSplit, if you're
using the new API. From there you can get the start and length of the
split.

>
>> Would it be possible to download that lzo file to your local box and
>> use lzop -d to see if it decompresses successfully? That way we can
>> isolate whether it's a compression bug or decompression.
>>
>> Bothe java LzoDecompressor and lzop -d were able to decompress the file
> correctly.  As a matter of fact, my job does not index .lzo files now but
> process each as a whole and it works
>

Interesting. If you can somehow make a reproducible test case I'd be
happy to look into this.

Thanks
-Todd

Re: Problem with large .lzo files

Reply via email to