Yea, gonna do that. 8-) Kim
On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <t...@bentzn.com> wrote: > Have you checked the content of the files you write? > > > /th > > On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: > > I have a simple M/R job using Mapper only thus no reducer. The mapper > > read a timestamp from the value, generate a path to the output file > > and writes the key and value to the output file. > > > > > > The input file is a sequence file, not compressed and stored in the > > HDFS, it has a size of 162.68 MB. > > > > > > Output also is written as a sequence file. > > > > > > > > However, after I ran my job, I have two output part files from the > > mapper. One has a size of 835.12 MB and the other has a size of 224.77 > > MB. So why is the total outputs size is so much larger? Shouldn't it > > be more or less equal to the input's size of 162.68MB since I just > > write the key and value passed to mapper to the output? > > > > > > Here is the mapper code snippet, > > > > public void map(BytesWritable key, BytesWritable value, Context > > context) throws IOException, InterruptedException { > > > > long timestamp = bytesToInt(value.getBytes(), > > TIMESTAMP_INDEX);; > > String tsStr = sdf.format(new Date(timestamp * 1000L)); > > > > mos.write(key, value, generateFileName(tsStr)); // mos is a > > MultipleOutputs object. > > } > > > > private String generateFileName(String key) { > > return outputDir+"/"+key+"/raw-vectors"; > > } > > > > > > And here are the job outputs, > > > > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2 > > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2 > > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 > > 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format > > Counters > > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0 > > 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters > > 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386 > > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272 > > 14/03/27 11:00:56 INFO mapred.JobClient: > > HDFS_BYTES_WRITTEN=1111374798 > > 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters > > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415 > > 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework > > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547 > > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes) > > snapshot=166428672 > > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0 > > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap > > usage (bytes)=38351872 > > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080 > > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes) > > snapshot=1240104960 > > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286 > > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0 > > > > > > TIA, > > > > > > Kim > > > > >