Stan,
See my comments inline.
Thanks, Hong
On May 18, 2010, at 8:44 AM, stan lee wrote:
Hi Guys,
I am trying to use compression to reduce the IO workload when trying
to run
a job but failed. I have several questions which needs your help.
For lzo compression, I found a guide
http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it
said "Note
that you must have both 32-bit and 64-bit liblzo2 installed" ? I am
not sure
whether it means that we also need 32bit liblzo2 installed even when
we are
on 64bit system. If so, why?
The answer on the wiki page is to the question of how to set up the
native libraries so that both 32-bit AND 64-bit java would work. If
you adhere to an environment with the same flavor of java across the
whole cluster, then the solution would not apply to you.
Also if I don't use lzo compression and tried to use gzip to
compress the
final reduce output file, I just set below value in mapred-site.xml,
but
seems it doesn't work(how can I find the final .gz file compressed?
I used
"hadoop dfs -l <dir>" and didn't find that.). My question: can we
use gzip
to compress the final result when it's not streaming job? How can we
ensure
that the compression has been enabled during a job execution?
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
The truth is, this option is honored by the implementation of
OutputFormat classes. If you use TextOutputFormat, then you should
see files like "part-xxxx.gz" in the output directory. If you write
your own output format class, then you should follow the
implementations of TextOutputFormat or SequenceFileOutputFormat to set
up compression properly.