Hi. As mentioned in the previous post, I tried to extend some legacy programs built on hadoop 0.19.2 to apply Lzo compression. I had tons of the problems (logical errors and troubles in implementation). After spending a whole week, finally I feel sorting things out, however, there are still several remaining questions. I would like to mention the questions first and I appreciate anyone who could give a quick answer. Then I will briefly review what I have experienced with LZO on two versions of hadoop (0.19.2 and 0.20.2)

=====begin of the questions=====================
Question 1: LzopCodec and LzoCodec?

According to Tom's book, LzopCodec is the LZO format with extra headers and is normally preferrable. LzopCodec generates *.lzo files. In contrast, LzoCodec is the pure LZO format and it generates *.lzo_deflate files. In my concept, both Codec should work. I haven't tried mixing the two Codecs in Mapper and Reducer, so I always use the same Codec within a same program. However, I had serious problem on Hadoop 0.19.2 (see how I implemented it later) when using the LzopCodec class (which is commonly suggested). The error is

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
        conf.setCompressMapOutput(true);
        conf.setMapOutputCompressorClass(LzopCodec.class);
        TextOutputFormat.setCompressOutput(conf,true);
        TextOutputFormat.setOutputCompressorClass(conf, LzopCodec.class);

and there were millions of them! Why this happened? I had sufficient disk space and the code was exactly the same. So later I had to stick on LzoCodec on Hadoop 0.19.2 (using the exact same code by just replacing "LzopCodec" to "LzoCodec") and it worked fine. But then come to another problem, the output files are *.lzo_deflate, which tool should I use to decompress them? The default Lzop tool seems not supporting that, or does it?

Then the story on Hadoop 0.20.2 is easier, both Codecs work fine.

Question 2: Partitioner and Secondary sort in Hadoop 0.20.2

In my old code I use Partitioner for secondary sort. Basically it was the TextPair and KeyPartitioner class mentioend in Tom's book. And I set the partitioner, KeyComparator, and valueGrouping as follows (conf is a JobConf class):

conf.setPartitionerClass(KeyPartitioner.class); //this works on Hadoop 0.19.2 conf.setOutputKeyComparatorClass(TextPair.Comparator.class); //this works on Hadoop 0.19.2 conf.setOutputValueGroupingComparator(TextPair.FirstComparator.class); //this works on Hadoop 0.19.2

When migrating to Hadoop 0.20.2, I blindly set the Job(job) class using the similar methods:

job.setPartitionerClass(KeyPartitioner.class); //notice that this is wrong in Hadoop 0.20.2! job.setSortComparatorClass(TextPair.Comparator.class); //notice that this is wrong in Hadoop 0.20.2! job.setGroupingComparatorClass(TextPair.FirstComparator.class); //notice that this is wrong in Hadoop 0.20.2!

However, watch out! This was wrong! The TextPair was not grouped by the first key, but by the combined keys! This caused tons of errors in the whole programs. So my question is: what is the equivalent method in the new API?


=========end of the questions=====================

It seems that I have to rely on the old API for a while because upgrading all of them and testing for correctness is not so easy. Therefore I installed lzo native library on Hadoop 0.19.2. The whole process was almost the same as on Hadoop 0.20.2. I was assuming that Hadoop 0.19.2 should support LZO without any additional trouble because in that version LZO was still there, correct? But lol, no! The main difference is in 0.19.2 four pieces of code need to be changed before compiling them. They are:

1. src/core/org/apache/hadoop/io/compress/BlockCompressorStream.java (add "public")
public class BlockCompressorStream extends CompressorStream

2. src/core/org/apache/hadoop/io/compress/BlockDecompressorStream.java (add "public")
public class BlockDecompressorStream extends DecompressorStream

3. src/core/org/apache/hadoop/io/compress/CompressorStream.java (add "public" and "protected")

public class CompressorStream extends CompressionOutputStream {
protected Compressor compressor;
protected byte[] buffer;
protected boolean closed = false;

4. src/core/org/apache/hadoop/io/compress/DecompressorStream.java (add "public" and "protected")

public class DecompressorStream extends CompressionInputStream {
protected Decompressor decompressor = null;
protected byte[] buffer;
protected boolean eof = false;
protected boolean closed = false;
public void checkStream() throws IOException {

Then I used "ant -Dcompile.native=true tar" and the remaining steps are pretty much the same as Hadoop 0.20.2. I add four lines in the "hadoop-env.sh" file because I dont have root account to change the global settings so everything was installed in my home directory. The "hadoop-0.19.3-dev-core.jar" may have different names depending on what source code you compiled, for instance, hadoop-gpl-compression or hadoop-lzo-...

export JAVA_LIBRARY_PATH=/path_to_hadoop_installation/hadoop-0.19.2/lib/native/Linux-amd64-64:/path_to_hadoop_installation/hadoop-0.19.2:/path_to_hadoop_installation/hadoop-0.19.2/lib
export C_INCLUDE_PATH=/path_to_lzo_installation/lzo_204_lib/include
export LD_LIBRARY_PATH=/path_to_lzo_installation/lzo_204_lib/lib
export HADOOP_CLASSPATH=/path_to_hadoop_installation/hadoop-0.19.2/hadoop-0.19.3-dev-core.jar

Of course, WATCH OUT the JAVA_LIBRARY_PATH=" " trap in the hadoop command, there is not patch for 0.19.2 so I just commented it out! Another thing to be aware is in hadoop 0.19.2, the lzo can have different package name, in my case it should be "org.apache.hadoop.io.compress.LzopCodec" or "org.apache.hadoop.io.compress.LzoCodec". So the entity in hadoop-default.xml file should be:

<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.LzoCodec,org.apache.hadoop.io.compress.LzoCodec</value>
<description>A list of the compression codec classes that can be used
               for compression/decompression.</description>
</property>

I had a big problem because I set it as "com.hadoop.compression.lzo.LzoCodec" and "com.hadoop.compression.lzo.LzopCodec" (which are used in Hadoop 0.20.2). And this caused lots of error whenever I ran hadoop (even the simplest program without using LZO) because it couldn't load the correct class. So, double check the jar file built from ant and don't make any mistake here.

That is, my battle with LZO on two versions. Hope my experience is interesting to anyone who has the same situation.

Shi

Reply via email to