[jira] [Commented] (HADOOP-12990) lz4 incompatibility between OS and Hadoop

John Zhuge (JIRA) Sun, 03 Apr 2016 00:12:07 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223149#comment-15223149
 ]


John Zhuge commented on HADOOP-12990:
-------------------------------------

Able to hack my way to make OS lz4 tool (r114) work with Hadoop (r123 lib):
{code}
$ lz4 -h 2>&1 | head -1
*** LZ4 Compression CLI 64-bits r114, by Yann Collet (Apr 14 2014) ***

$ lz4 10rows.txt 10rows.txt.r114.lz4
Compressed 310 bytes into 105 bytes ==> 33.87%

$ od -t x1 10rows.txt.r114.lz4
0000000 04 22 4d 18 64 70 b9 56 00 00 00 ff 13 30 30 31
0000020 7c 63 31 7c 63 32 7c 63 33 7c 63 34 7c 63 35 7c
0000040 63 36 7c 63 37 7c 63 38 7c 63 39 0a 30 30 32 1f
0000060 00 0b 1f 33 1f 00 0b 1f 34 1f 00 0b 1f 35 1f 00
0000100 0b 1f 36 1f 00 0b 1f 37 1f 00 0b 1f 38 1f 00 0b
0000120 1f 39 1f 00 0a 2f 31 30 1f 00 04 50 38 7c 63 39
0000140 0a 00 00 00 00 eb 01 45 d5
0000151

### Skip 7-byte header and 4-byte len ("56 00 00 00" is 86)
$ dd if=10rows.txt.r114.lz4 of=s11c86.r114.lz4 skip=11 bs=1 count=86
86+0 records in
86+0 records out
86 bytes (86 B) copied, 0.000288006 s, 299 kB/s

### Choose a block size > uncompressed content size
$ echo -ne '\x00\x01\x00\x00' > originalBlockSize

### Prepare the len in the endian Hadoop prefers
$ echo -ne '\x00\x00\x00\x56' > len

### orginalBlockSize + len + compressed_bytes
$ cat originalBlockSize len s11c86.r114.lz4 >a2.r114.lz4

$ hdfs dfs -put a2.r114.lz4 /tmp
$ hdfs dfs -text /tmp/a2.r114.lz4
16/04/02 23:34:52 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
001|c1|c2|c3|c4|c5|c6|c7|c8|c9
002|c1|c2|c3|c4|c5|c6|c7|c8|c9
003|c1|c2|c3|c4|c5|c6|c7|c8|c9
004|c1|c2|c3|c4|c5|c6|c7|c8|c9
005|c1|c2|c3|c4|c5|c6|c7|c8|c9
006|c1|c2|c3|c4|c5|c6|c7|c8|c9
007|c1|c2|c3|c4|c5|c6|c7|c8|c9
008|c1|c2|c3|c4|c5|c6|c7|c8|c9
009|c1|c2|c3|c4|c5|c6|c7|c8|c9
010|c1|c2|c3|c4|c5|c6|c7|c8|c9
{code}

> lz4 incompatibility between OS and Hadoop
> -----------------------------------------
>
>                 Key: HADOOP-12990
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12990
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io, native
>    Affects Versions: 2.6.0
>            Reporter: John Zhuge
>            Priority: Minor
>
> {{hdfs dfs -text}} hit exception when trying to view the compression file 
> created by Linux lz4 tool.
> The Hadoop version has HADOOP-11184 "update lz4 to r123", thus it is using 
> LZ4 library in release r123.
> Linux lz4 version:
> {code}
> $ /tmp/lz4 -h 2>&1 | head -1
> *** LZ4 Compression CLI 64-bits r123, by Yann Collet (Apr  1 2016) ***
> {code}
> Test steps:
> {code}
> $ cat 10rows.txt
> 001|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 002|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 003|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 004|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 005|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 006|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 007|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 008|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 009|c1|c2|c3|c4|c5|c6|c7|c8|c9
> 010|c1|c2|c3|c4|c5|c6|c7|c8|c9
> $ /tmp/lz4 10rows.txt 10rows.txt.r123.lz4
> Compressed 310 bytes into 105 bytes ==> 33.87%
> $ hdfs dfs -put 10rows.txt.r123.lz4 /tmp
> $ hdfs dfs -text /tmp/10rows.txt.r123.lz4
> 16/04/01 08:19:07 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>     at 
> org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123)
>     at 
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98)
>     at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>     at java.io.InputStream.read(InputStream.java:101)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
>     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
>     at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:106)
>     at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:101)
>     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
>     at 
> org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
>     at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
>     at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
>     at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
>     at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
>     at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>     at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-12990) lz4 incompatibility between OS and Hadoop

Reply via email to