Geelong,

1.       These files will probably be some standard format like .gz or .bz2 or 
.zip.  In that case, pick an appropriate InputFormat.  See e.g. 
http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/, 
http://stackoverflow.com/questions/14497572/reading-gzipped-file-in-hadoop-using-custom-recordreader

2.       Generally, compression is a Good Thing and will improve performance.  
But only if you use a fast compressor like LZO or Snappy.  Gzip, ZIP, BZ2, etc 
are no good for this.  You also need to ensure that your compressed files are 
"splittable" if you are going to create a single file that will be processed by 
a later MR stage, for this a SequenceFile is helpful.  For typical intermediate 
outputs it doesn't matter as much because you will have a folder of file parts 
and these are "pre split" in some sense.  Once upon a time, LZO compression was 
a thing that you had to install as a separate component, but I think the modern 
distros include it.  See for example: 
http://kickstarthadoop.blogspot.com/2012/02/use-compression-with-mapreduce.html 
, http://blog.cloudera.com/blog/2009/05/10-mapreduce-tips/, 
http://my.safaribooksonline.com/book/software-engineering-and-development/9781449328917/compression/id3689058,
 
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-4/compression
 (section 4.2 in the Elephant book).

John

From: Geelong Yao [mailto:geelong...@gmail.com]
Sent: Thursday, June 20, 2013 12:30 AM
To: user@hadoop.apache.org
Subject: some idea about the Data Compression

Hi , everyone

I am working on the data compression
1.data compression before the raw data were uploaded into HDFS.
2.data compression while processing in Hadoop to reduce the pressure on IO.


Can anyone give me some ideas on above 2 directions


BRs
Geelong

--
>From Good To Great

Reply via email to