Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

2014-03-28 Thread Kim Chew
None of that. I checked the the input file's SequenceFile Header and it says "org.apache.hadoop.io.compress.zlib.BuiltInZlibDeflater" Kim On Fri, Mar 28, 2014 at 10:34 AM, Hardik Pandya wrote: > what is your compression format gzip, lzo or snappy > > for lzo final output > > FileOutputFormat.s

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

2014-03-28 Thread Hardik Pandya
what is your compression format gzip, lzo or snappy for lzo final output FileOutputFormat.setCompressOutput(conf, true); FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class); In addition, to make LZO splittable, you need to make a LZO index file. On Thu, Mar 27, 2014 at 8:57 PM, Kim

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

2014-03-27 Thread Kim Chew
Thanks folks. I am not awared my input data file has been compressed. FileOutputFromat.setCompressOutput() is set to true when the file is written. 8-( Kim On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead wrote: > The following might answer you partially: > > Input key is not read from HDFS, it is

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

2014-03-27 Thread Mostafa Ead
The following might answer you partially: Input key is not read from HDFS, it is auto generated as the offset of the input value in the input file. I think that is (partially) why read hdfs bytes is smaller than written hdfs bytes. On Mar 27, 2014 1:34 PM, "Kim Chew" wrote: > I am also wonderin

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

2014-03-27 Thread Kim Chew
I am also wondering if, say, I have two identical timestamp so they are going to be written to the same file. Does MulitpleOutputs handle appending? Thanks. Kim On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen wrote: > Have you checked the content of the files you write? > > > /th > > On Thu,

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

2014-03-27 Thread Kim Chew
Yea, gonna do that. 8-) Kim On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen wrote: > Have you checked the content of the files you write? > > > /th > > On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: > > I have a simple M/R job using Mapper only thus no reducer. The mapper > > read a times

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

2014-03-27 Thread Thomas Bentsen
Have you checked the content of the files you write? /th On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: > I have a simple M/R job using Mapper only thus no reducer. The mapper > read a timestamp from the value, generate a path to the output file > and writes the key and value to the output f

Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

2014-03-27 Thread Kim Chew
I have a simple M/R job using Mapper only thus no reducer. The mapper read a timestamp from the value, generate a path to the output file and writes the key and value to the output file. The input file is a sequence file, not compressed and stored in the HDFS, it has a size of 162.68 MB. Output a