Hey Dmitriy, This is very interesting (and worrisome in a way!) I'll try to take a look this afternoon.
-Todd On Thu, Apr 1, 2010 at 12:16 AM, Dmitriy Ryaboy <dmit...@twitter.com> wrote: > Hi folks, > We write a lot of lzo-compressed files to HDFS -- some via scribe, > some using internal tools. Occasionally, we discover that the created > lzo files cannot be read from HDFS -- they get through some (often > large) portion of the file, and then fail with the following stack > trace: > > Exception in thread "main" java.lang.InternalError: > lzo1x_decompress_safe returned: > at > com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native > Method) > at > com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:303) > at > com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122) > at > com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:223) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) > at java.io.InputStream.read(InputStream.java:85) > at com.twitter.twadoop.jobs.LzoReadTest.main(LzoReadTest.java:51) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > The initial thought is of course that the lzo file is corrupt -- > however, plain-jane lzop is able to read these files. Moreover, if we > pull the files out of hadoop, uncompress them, compress them again, > and put them back into HDFS, we can usually read them from HDFS as > well. > > We've been thinking that this strange behavior is caused by a bug in > the hadoop-lzo libraries (we use the version with Twitter and Cloudera > fixes, on github: http://github.com/kevinweil/hadoop-lzo ) > However, today I discovered that using the exact same environment, > codec, and InputStreams, we can successfully read from the local file > system, but cannot read from HDFS. This appears to point at possible > issues in the FSDataInputStream or further down the stack. > > Here's a small test class that tries to read the same file from HDFS > and from the local FS, and the output of running it on our cluster. > We are using the CDH2 distribution. > > https://gist.github.com/e1bf7e4327c7aef56303 > > Any ideas on what could be going on? > > Thanks, > -Dmitriy > -- Todd Lipcon Software Engineer, Cloudera