Hi. As mentioned in the previous post, I tried to extend some legacy
programs built on hadoop 0.19.2 to apply Lzo compression. I had tons of
the problems (logical errors and troubles in implementation). After
spending a whole week, finally I feel sorting things out, however, there
are still several remaining questions. I would like to mention the
questions first and I appreciate anyone who could give a quick answer.
Then I will briefly review what I have experienced with LZO on two
versions of hadoop (0.19.2 and 0.20.2)
=====begin of the questions=====================
Question 1: LzopCodec and LzoCodec?
According to Tom's book, LzopCodec is the LZO format with extra headers
and is normally preferrable. LzopCodec generates *.lzo files. In
contrast, LzoCodec is the pure LZO format and it generates *.lzo_deflate
files. In my concept, both Codec should work. I haven't tried mixing the
two Codecs in Mapper and Reducer, so I always use the same Codec within
a same program. However, I had serious problem on Hadoop 0.19.2 (see how
I implemented it later) when using the LzopCodec class (which is
commonly suggested). The error is
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle
Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
conf.setCompressMapOutput(true);
conf.setMapOutputCompressorClass(LzopCodec.class);
TextOutputFormat.setCompressOutput(conf,true);
TextOutputFormat.setOutputCompressorClass(conf, LzopCodec.class);
and there were millions of them! Why this happened? I had sufficient
disk space and the code was exactly the same. So later I had to stick on
LzoCodec on Hadoop 0.19.2 (using the exact same code by just replacing
"LzopCodec" to "LzoCodec") and it worked fine. But then come to another
problem, the output files are *.lzo_deflate, which tool should I use to
decompress them? The default Lzop tool seems not supporting that, or
does it?
Then the story on Hadoop 0.20.2 is easier, both Codecs work fine.
Question 2: Partitioner and Secondary sort in Hadoop 0.20.2
In my old code I use Partitioner for secondary sort. Basically it was
the TextPair and KeyPartitioner class mentioend in Tom's book. And I set
the partitioner, KeyComparator, and valueGrouping as follows (conf is a
JobConf class):
conf.setPartitionerClass(KeyPartitioner.class); //this works
on Hadoop 0.19.2
conf.setOutputKeyComparatorClass(TextPair.Comparator.class);
//this works on Hadoop 0.19.2
conf.setOutputValueGroupingComparator(TextPair.FirstComparator.class);
//this works on Hadoop 0.19.2
When migrating to Hadoop 0.20.2, I blindly set the Job(job) class using
the similar methods:
job.setPartitionerClass(KeyPartitioner.class); //notice that
this is wrong in Hadoop 0.20.2!
job.setSortComparatorClass(TextPair.Comparator.class);
//notice that this is wrong in Hadoop 0.20.2!
job.setGroupingComparatorClass(TextPair.FirstComparator.class);
//notice that this is wrong in Hadoop 0.20.2!
However, watch out! This was wrong! The TextPair was not grouped by the
first key, but by the combined keys! This caused tons of errors in the
whole programs. So my question is: what is the equivalent method in the
new API?
=========end of the questions=====================
It seems that I have to rely on the old API for a while because
upgrading all of them and testing for correctness is not so easy.
Therefore I installed lzo native library on Hadoop 0.19.2. The whole
process was almost the same as on Hadoop 0.20.2. I was assuming that
Hadoop 0.19.2 should support LZO without any additional trouble because
in that version LZO was still there, correct? But lol, no! The main
difference is in 0.19.2 four pieces of code need to be changed before
compiling them. They are:
1. src/core/org/apache/hadoop/io/compress/BlockCompressorStream.java
(add "public")
public class BlockCompressorStream extends CompressorStream
2. src/core/org/apache/hadoop/io/compress/BlockDecompressorStream.java
(add "public")
public class BlockDecompressorStream extends DecompressorStream
3. src/core/org/apache/hadoop/io/compress/CompressorStream.java (add
"public" and "protected")
public class CompressorStream extends CompressionOutputStream {
protected Compressor compressor;
protected byte[] buffer;
protected boolean closed = false;
4. src/core/org/apache/hadoop/io/compress/DecompressorStream.java (add
"public" and "protected")
public class DecompressorStream extends CompressionInputStream {
protected Decompressor decompressor = null;
protected byte[] buffer;
protected boolean eof = false;
protected boolean closed = false;
public void checkStream() throws IOException {
Then I used "ant -Dcompile.native=true tar" and the remaining steps are
pretty much the same as Hadoop 0.20.2. I add four lines in the
"hadoop-env.sh" file because I dont have root account to change the
global settings so everything was installed in my home directory. The
"hadoop-0.19.3-dev-core.jar" may have different names depending on what
source code you compiled, for instance, hadoop-gpl-compression or
hadoop-lzo-...
export
JAVA_LIBRARY_PATH=/path_to_hadoop_installation/hadoop-0.19.2/lib/native/Linux-amd64-64:/path_to_hadoop_installation/hadoop-0.19.2:/path_to_hadoop_installation/hadoop-0.19.2/lib
export C_INCLUDE_PATH=/path_to_lzo_installation/lzo_204_lib/include
export LD_LIBRARY_PATH=/path_to_lzo_installation/lzo_204_lib/lib
export
HADOOP_CLASSPATH=/path_to_hadoop_installation/hadoop-0.19.2/hadoop-0.19.3-dev-core.jar
Of course, WATCH OUT the JAVA_LIBRARY_PATH=" " trap in the hadoop
command, there is not patch for 0.19.2 so I just commented it out!
Another thing to be aware is in hadoop 0.19.2, the lzo can have
different package name, in my case it should be
"org.apache.hadoop.io.compress.LzopCodec" or
"org.apache.hadoop.io.compress.LzoCodec". So the entity in
hadoop-default.xml file should be:
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.LzoCodec,org.apache.hadoop.io.compress.LzoCodec</value>
<description>A list of the compression codec classes that can be used
for compression/decompression.</description>
</property>
I had a big problem because I set it as
"com.hadoop.compression.lzo.LzoCodec" and
"com.hadoop.compression.lzo.LzopCodec" (which are used in Hadoop
0.20.2). And this caused lots of error whenever I ran hadoop (even the
simplest program without using LZO) because it couldn't load the correct
class. So, double check the jar file built from ant and don't make any
mistake here.
That is, my battle with LZO on two versions. Hope my experience is
interesting to anyone who has the same situation.
Shi