battle of LZO on hadoop 0.19.2 and 0.20.2

Shi Yu Tue, 22 Mar 2011 14:13:13 -0700

Hi. As mentioned in the previous post, I tried to extend some legacyprograms built on hadoop 0.19.2 to apply Lzo compression. I had tons ofthe problems (logical errors and troubles in implementation). Afterspending a whole week, finally I feel sorting things out, however, thereare still several remaining questions. I would like to mention thequestions first and I appreciate anyone who could give a quick answer.Then I will briefly review what I have experienced with LZO on twoversions of hadoop (0.19.2 and 0.20.2)


=====begin of the questions=====================
Question 1: LzopCodec and LzoCodec?

According to Tom's book, LzopCodec is the LZO format with extra headersand is normally preferrable. LzopCodec generates *.lzo files. Incontrast, LzoCodec is the pure LZO format and it generates *.lzo_deflatefiles. In my concept, both Codec should work. I haven't tried mixing thetwo Codecs in Mapper and Reducer, so I always use the same Codec withina same program. However, I had serious problem on Hadoop 0.19.2 (see howI implemented it later) when using the LzopCodec class (which iscommonly suggested). The error is

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.ShuffleError: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

        conf.setCompressMapOutput(true);
        conf.setMapOutputCompressorClass(LzopCodec.class);
        TextOutputFormat.setCompressOutput(conf,true);
        TextOutputFormat.setOutputCompressorClass(conf, LzopCodec.class);

and there were millions of them! Why this happened? I had sufficientdisk space and the code was exactly the same. So later I had to stick onLzoCodec on Hadoop 0.19.2 (using the exact same code by just replacing"LzopCodec" to "LzoCodec") and it worked fine. But then come to anotherproblem, the output files are *.lzo_deflate, which tool should I use todecompress them? The default Lzop tool seems not supporting that, ordoes it?


Then the story on Hadoop 0.20.2 is easier, both Codecs work fine.

Question 2: Partitioner and Secondary sort in Hadoop 0.20.2

In my old code I use Partitioner for secondary sort. Basically it wasthe TextPair and KeyPartitioner class mentioend in Tom's book. And I setthe partitioner, KeyComparator, and valueGrouping as follows (conf is aJobConf class):

conf.setPartitionerClass(KeyPartitioner.class); //this workson Hadoop 0.19.2conf.setOutputKeyComparatorClass(TextPair.Comparator.class);//this works on Hadoop 0.19.2conf.setOutputValueGroupingComparator(TextPair.FirstComparator.class);//this works on Hadoop 0.19.2

When migrating to Hadoop 0.20.2, I blindly set the Job(job) class usingthe similar methods:

job.setPartitionerClass(KeyPartitioner.class); //notice thatthis is wrong in Hadoop 0.20.2!job.setSortComparatorClass(TextPair.Comparator.class);//notice that this is wrong in Hadoop 0.20.2!job.setGroupingComparatorClass(TextPair.FirstComparator.class);//notice that this is wrong in Hadoop 0.20.2!

However, watch out! This was wrong! The TextPair was not grouped by thefirst key, but by the combined keys! This caused tons of errors in thewhole programs. So my question is: what is the equivalent method in thenew API?



=========end of the questions=====================

It seems that I have to rely on the old API for a while becauseupgrading all of them and testing for correctness is not so easy.Therefore I installed lzo native library on Hadoop 0.19.2. The wholeprocess was almost the same as on Hadoop 0.20.2. I was assuming thatHadoop 0.19.2 should support LZO without any additional trouble becausein that version LZO was still there, correct? But lol, no! The maindifference is in 0.19.2 four pieces of code need to be changed beforecompiling them. They are:

1. src/core/org/apache/hadoop/io/compress/BlockCompressorStream.java(add "public")

public class BlockCompressorStream extends CompressorStream

2. src/core/org/apache/hadoop/io/compress/BlockDecompressorStream.java(add "public")

public class BlockDecompressorStream extends DecompressorStream

3. src/core/org/apache/hadoop/io/compress/CompressorStream.java (add"public" and "protected")


public class CompressorStream extends CompressionOutputStream {
protected Compressor compressor;
protected byte[] buffer;
protected boolean closed = false;

4. src/core/org/apache/hadoop/io/compress/DecompressorStream.java (add"public" and "protected")


public class DecompressorStream extends CompressionInputStream {
protected Decompressor decompressor = null;
protected byte[] buffer;
protected boolean eof = false;
protected boolean closed = false;
public void checkStream() throws IOException {

Then I used "ant -Dcompile.native=true tar" and the remaining steps arepretty much the same as Hadoop 0.20.2. I add four lines in the"hadoop-env.sh" file because I dont have root account to change theglobal settings so everything was installed in my home directory. The"hadoop-0.19.3-dev-core.jar" may have different names depending on whatsource code you compiled, for instance, hadoop-gpl-compression orhadoop-lzo-...

exportJAVA_LIBRARY_PATH=/path_to_hadoop_installation/hadoop-0.19.2/lib/native/Linux-amd64-64:/path_to_hadoop_installation/hadoop-0.19.2:/path_to_hadoop_installation/hadoop-0.19.2/lib

export C_INCLUDE_PATH=/path_to_lzo_installation/lzo_204_lib/include
export LD_LIBRARY_PATH=/path_to_lzo_installation/lzo_204_lib/lib

exportHADOOP_CLASSPATH=/path_to_hadoop_installation/hadoop-0.19.2/hadoop-0.19.3-dev-core.jar

Of course, WATCH OUT the JAVA_LIBRARY_PATH=" " trap in the hadoopcommand, there is not patch for 0.19.2 so I just commented it out!Another thing to be aware is in hadoop 0.19.2, the lzo can havedifferent package name, in my case it should be"org.apache.hadoop.io.compress.LzopCodec" or"org.apache.hadoop.io.compress.LzoCodec". So the entity inhadoop-default.xml file should be:


<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.LzoCodec,org.apache.hadoop.io.compress.LzoCodec</value>
<description>A list of the compression codec classes that can be used
               for compression/decompression.</description>
</property>

I had a big problem because I set it as"com.hadoop.compression.lzo.LzoCodec" and"com.hadoop.compression.lzo.LzopCodec" (which are used in Hadoop0.20.2). And this caused lots of error whenever I ran hadoop (even thesimplest program without using LZO) because it couldn't load the correctclass. So, double check the jar file built from ant and don't make anymistake here.

That is, my battle with LZO on two versions. Hope my experience isinteresting to anyone who has the same situation.

Shi

battle of LZO on hadoop 0.19.2 and 0.20.2

Reply via email to