what is your compression format gzip, lzo or snappy for lzo final output
FileOutputFormat.setCompressOutput(conf, true); FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class); In addition, to make LZO splittable, you need to make a LZO index file. On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew <kchew...@gmail.com> wrote: > Thanks folks. > > I am not awared my input data file has been compressed. > FileOutputFromat.setCompressOutput() is set to true when the file is > written. 8-( > > Kim > > > On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mostafa.g....@gmail.com>wrote: > >> The following might answer you partially: >> >> Input key is not read from HDFS, it is auto generated as the offset of >> the input value in the input file. I think that is (partially) why read >> hdfs bytes is smaller than written hdfs bytes. >> On Mar 27, 2014 1:34 PM, "Kim Chew" <kchew...@gmail.com> wrote: >> >>> I am also wondering if, say, I have two identical timestamp so they are >>> going to be written to the same file. Does MulitpleOutputs handle appending? >>> >>> Thanks. >>> >>> Kim >>> >>> >>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <t...@bentzn.com> wrote: >>> >>>> Have you checked the content of the files you write? >>>> >>>> >>>> /th >>>> >>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: >>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper >>>> > read a timestamp from the value, generate a path to the output file >>>> > and writes the key and value to the output file. >>>> > >>>> > >>>> > The input file is a sequence file, not compressed and stored in the >>>> > HDFS, it has a size of 162.68 MB. >>>> > >>>> > >>>> > Output also is written as a sequence file. >>>> > >>>> > >>>> > >>>> > However, after I ran my job, I have two output part files from the >>>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77 >>>> > MB. So why is the total outputs size is so much larger? Shouldn't it >>>> > be more or less equal to the input's size of 162.68MB since I just >>>> > write the key and value passed to mapper to the output? >>>> > >>>> > >>>> > Here is the mapper code snippet, >>>> > >>>> > public void map(BytesWritable key, BytesWritable value, Context >>>> > context) throws IOException, InterruptedException { >>>> > >>>> > long timestamp = bytesToInt(value.getBytes(), >>>> > TIMESTAMP_INDEX);; >>>> > String tsStr = sdf.format(new Date(timestamp * 1000L)); >>>> > >>>> > mos.write(key, value, generateFileName(tsStr)); // mos is a >>>> > MultipleOutputs object. >>>> > } >>>> > >>>> > private String generateFileName(String key) { >>>> > return outputDir+"/"+key+"/raw-vectors"; >>>> > } >>>> > >>>> > >>>> > And here are the job outputs, >>>> > >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format >>>> > Counters >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: >>>> > HDFS_BYTES_WRITTEN=1111374798 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes) >>>> > snapshot=166428672 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap >>>> > usage (bytes)=38351872 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes) >>>> > snapshot=1240104960 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286 >>>> > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0 >>>> > >>>> > >>>> > TIA, >>>> > >>>> > >>>> > Kim >>>> > >>>> >>>> >>>> >>> >