Hey Dmitriy,

This is very interesting (and worrisome in a way!) I'll try to take a look
this afternoon.

-Todd

On Thu, Apr 1, 2010 at 12:16 AM, Dmitriy Ryaboy <dmit...@twitter.com> wrote:

> Hi folks,
> We write a lot of lzo-compressed files to HDFS -- some via scribe,
> some using internal tools. Occasionally, we discover that the created
> lzo files cannot be read from HDFS -- they get through some (often
> large) portion of the file, and then fail with the following stack
> trace:
>
> Exception in thread "main" java.lang.InternalError:
> lzo1x_decompress_safe returned:
>        at
> com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native
> Method)
>        at
> com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:303)
>        at
> com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122)
>        at
> com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:223)
>        at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
>        at java.io.InputStream.read(InputStream.java:85)
>        at com.twitter.twadoop.jobs.LzoReadTest.main(LzoReadTest.java:51)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> The initial thought is of course that the lzo file is corrupt --
> however, plain-jane lzop is able to read these files. Moreover, if we
> pull the files out of hadoop, uncompress them, compress them again,
> and put them back into HDFS, we can usually read them from HDFS as
> well.
>
> We've been thinking that this strange behavior is caused by a bug in
> the hadoop-lzo libraries (we use the version with Twitter and Cloudera
> fixes, on github: http://github.com/kevinweil/hadoop-lzo )
> However, today I discovered that using the exact same environment,
> codec, and InputStreams, we can successfully read from the local file
> system, but cannot read from HDFS. This appears to point at possible
> issues in the FSDataInputStream or further down the stack.
>
> Here's a small test class that tries to read the same file from HDFS
> and from the local FS, and the output of running it on our cluster.
> We are using the CDH2 distribution.
>
> https://gist.github.com/e1bf7e4327c7aef56303
>
> Any ideas on what could be going on?
>
> Thanks,
> -Dmitriy
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to