Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?
None of that. I checked the the input file's SequenceFile Header and it says "org.apache.hadoop.io.compress.zlib.BuiltInZlibDeflater" Kim On Fri, Mar 28, 2014 at 10:34 AM, Hardik Pandya wrote: > what is your compression format gzip, lzo or snappy > > for lzo final output > > FileOutputFormat.setCompressOutput(conf, true); > FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class); > > In addition, to make LZO splittable, you need to make a LZO index file. > > > On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew wrote: > >> Thanks folks. >> >> I am not awared my input data file has been compressed. >> FileOutputFromat.setCompressOutput() is set to true when the file is >> written. 8-( >> >> Kim >> >> >> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead wrote: >> >>> The following might answer you partially: >>> >>> Input key is not read from HDFS, it is auto generated as the offset of >>> the input value in the input file. I think that is (partially) why read >>> hdfs bytes is smaller than written hdfs bytes. >>> On Mar 27, 2014 1:34 PM, "Kim Chew" wrote: >>> I am also wondering if, say, I have two identical timestamp so they are going to be written to the same file. Does MulitpleOutputs handle appending? Thanks. Kim On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen wrote: > Have you checked the content of the files you write? > > > /th > > On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: > > I have a simple M/R job using Mapper only thus no reducer. The mapper > > read a timestamp from the value, generate a path to the output file > > and writes the key and value to the output file. > > > > > > The input file is a sequence file, not compressed and stored in the > > HDFS, it has a size of 162.68 MB. > > > > > > Output also is written as a sequence file. > > > > > > > > However, after I ran my job, I have two output part files from the > > mapper. One has a size of 835.12 MB and the other has a size of > 224.77 > > MB. So why is the total outputs size is so much larger? Shouldn't it > > be more or less equal to the input's size of 162.68MB since I just > > write the key and value passed to mapper to the output? > > > > > > Here is the mapper code snippet, > > > > public void map(BytesWritable key, BytesWritable value, Context > > context) throws IOException, InterruptedException { > > > > long timestamp = bytesToInt(value.getBytes(), > > TIMESTAMP_INDEX);; > > String tsStr = sdf.format(new Date(timestamp * 1000L)); > > > > mos.write(key, value, generateFileName(tsStr)); // mos is a > > MultipleOutputs object. > > } > > > > private String generateFileName(String key) { > > return outputDir+"/"+key+"/raw-vectors"; > > } > > > > > > And here are the job outputs, > > > > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2 > > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2 > > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 > > 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format > > Counters > > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0 > > 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters > > 14/03/27 11:00:56 INFO mapred.JobClient: > HDFS_BYTES_READ=171086386 > > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272 > > 14/03/27 11:00:56 INFO mapred.JobClient: > > HDFS_BYTES_WRITTEN=374798 > > 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters > > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415 > > 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework > > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547 > > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes) > > snapshot=166428672 > > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0 > > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap > > usage (bytes)=38351872 > > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent > (ms)=20080 > > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes) > > snapshot=1240104960 > > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286 > > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0 > > > > > > TIA, > > > > > > Kim > > > > > >> >
Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?
what is your compression format gzip, lzo or snappy for lzo final output FileOutputFormat.setCompressOutput(conf, true); FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class); In addition, to make LZO splittable, you need to make a LZO index file. On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew wrote: > Thanks folks. > > I am not awared my input data file has been compressed. > FileOutputFromat.setCompressOutput() is set to true when the file is > written. 8-( > > Kim > > > On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead wrote: > >> The following might answer you partially: >> >> Input key is not read from HDFS, it is auto generated as the offset of >> the input value in the input file. I think that is (partially) why read >> hdfs bytes is smaller than written hdfs bytes. >> On Mar 27, 2014 1:34 PM, "Kim Chew" wrote: >> >>> I am also wondering if, say, I have two identical timestamp so they are >>> going to be written to the same file. Does MulitpleOutputs handle appending? >>> >>> Thanks. >>> >>> Kim >>> >>> >>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen wrote: >>> Have you checked the content of the files you write? /th On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: > I have a simple M/R job using Mapper only thus no reducer. The mapper > read a timestamp from the value, generate a path to the output file > and writes the key and value to the output file. > > > The input file is a sequence file, not compressed and stored in the > HDFS, it has a size of 162.68 MB. > > > Output also is written as a sequence file. > > > > However, after I ran my job, I have two output part files from the > mapper. One has a size of 835.12 MB and the other has a size of 224.77 > MB. So why is the total outputs size is so much larger? Shouldn't it > be more or less equal to the input's size of 162.68MB since I just > write the key and value passed to mapper to the output? > > > Here is the mapper code snippet, > > public void map(BytesWritable key, BytesWritable value, Context > context) throws IOException, InterruptedException { > > long timestamp = bytesToInt(value.getBytes(), > TIMESTAMP_INDEX);; > String tsStr = sdf.format(new Date(timestamp * 1000L)); > > mos.write(key, value, generateFileName(tsStr)); // mos is a > MultipleOutputs object. > } > > private String generateFileName(String key) { > return outputDir+"/"+key+"/raw-vectors"; > } > > > And here are the job outputs, > > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2 > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2 > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 > 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format > Counters > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0 > 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters > 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386 > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272 > 14/03/27 11:00:56 INFO mapred.JobClient: > HDFS_BYTES_WRITTEN=374798 > 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415 > 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547 > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes) > snapshot=166428672 > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0 > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap > usage (bytes)=38351872 > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080 > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes) > snapshot=1240104960 > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286 > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0 > > > TIA, > > > Kim > >>> >
Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?
Thanks folks. I am not awared my input data file has been compressed. FileOutputFromat.setCompressOutput() is set to true when the file is written. 8-( Kim On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead wrote: > The following might answer you partially: > > Input key is not read from HDFS, it is auto generated as the offset of the > input value in the input file. I think that is (partially) why read hdfs > bytes is smaller than written hdfs bytes. > On Mar 27, 2014 1:34 PM, "Kim Chew" wrote: > >> I am also wondering if, say, I have two identical timestamp so they are >> going to be written to the same file. Does MulitpleOutputs handle appending? >> >> Thanks. >> >> Kim >> >> >> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen wrote: >> >>> Have you checked the content of the files you write? >>> >>> >>> /th >>> >>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: >>> > I have a simple M/R job using Mapper only thus no reducer. The mapper >>> > read a timestamp from the value, generate a path to the output file >>> > and writes the key and value to the output file. >>> > >>> > >>> > The input file is a sequence file, not compressed and stored in the >>> > HDFS, it has a size of 162.68 MB. >>> > >>> > >>> > Output also is written as a sequence file. >>> > >>> > >>> > >>> > However, after I ran my job, I have two output part files from the >>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77 >>> > MB. So why is the total outputs size is so much larger? Shouldn't it >>> > be more or less equal to the input's size of 162.68MB since I just >>> > write the key and value passed to mapper to the output? >>> > >>> > >>> > Here is the mapper code snippet, >>> > >>> > public void map(BytesWritable key, BytesWritable value, Context >>> > context) throws IOException, InterruptedException { >>> > >>> > long timestamp = bytesToInt(value.getBytes(), >>> > TIMESTAMP_INDEX);; >>> > String tsStr = sdf.format(new Date(timestamp * 1000L)); >>> > >>> > mos.write(key, value, generateFileName(tsStr)); // mos is a >>> > MultipleOutputs object. >>> > } >>> > >>> > private String generateFileName(String key) { >>> > return outputDir+"/"+key+"/raw-vectors"; >>> > } >>> > >>> > >>> > And here are the job outputs, >>> > >>> > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format >>> > Counters >>> > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters >>> > 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: >>> > HDFS_BYTES_WRITTEN=374798 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters >>> > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework >>> > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes) >>> > snapshot=166428672 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap >>> > usage (bytes)=38351872 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes) >>> > snapshot=1240104960 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286 >>> > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0 >>> > >>> > >>> > TIA, >>> > >>> > >>> > Kim >>> > >>> >>> >>> >>
Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?
The following might answer you partially: Input key is not read from HDFS, it is auto generated as the offset of the input value in the input file. I think that is (partially) why read hdfs bytes is smaller than written hdfs bytes. On Mar 27, 2014 1:34 PM, "Kim Chew" wrote: > I am also wondering if, say, I have two identical timestamp so they are > going to be written to the same file. Does MulitpleOutputs handle appending? > > Thanks. > > Kim > > > On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen wrote: > >> Have you checked the content of the files you write? >> >> >> /th >> >> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: >> > I have a simple M/R job using Mapper only thus no reducer. The mapper >> > read a timestamp from the value, generate a path to the output file >> > and writes the key and value to the output file. >> > >> > >> > The input file is a sequence file, not compressed and stored in the >> > HDFS, it has a size of 162.68 MB. >> > >> > >> > Output also is written as a sequence file. >> > >> > >> > >> > However, after I ran my job, I have two output part files from the >> > mapper. One has a size of 835.12 MB and the other has a size of 224.77 >> > MB. So why is the total outputs size is so much larger? Shouldn't it >> > be more or less equal to the input's size of 162.68MB since I just >> > write the key and value passed to mapper to the output? >> > >> > >> > Here is the mapper code snippet, >> > >> > public void map(BytesWritable key, BytesWritable value, Context >> > context) throws IOException, InterruptedException { >> > >> > long timestamp = bytesToInt(value.getBytes(), >> > TIMESTAMP_INDEX);; >> > String tsStr = sdf.format(new Date(timestamp * 1000L)); >> > >> > mos.write(key, value, generateFileName(tsStr)); // mos is a >> > MultipleOutputs object. >> > } >> > >> > private String generateFileName(String key) { >> > return outputDir+"/"+key+"/raw-vectors"; >> > } >> > >> > >> > And here are the job outputs, >> > >> > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2 >> > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 >> > 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format >> > Counters >> > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0 >> > 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters >> > 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386 >> > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272 >> > 14/03/27 11:00:56 INFO mapred.JobClient: >> > HDFS_BYTES_WRITTEN=374798 >> > 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters >> > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework >> > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes) >> > snapshot=166428672 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap >> > usage (bytes)=38351872 >> > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes) >> > snapshot=1240104960 >> > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286 >> > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0 >> > >> > >> > TIA, >> > >> > >> > Kim >> > >> >> >> >
Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?
I am also wondering if, say, I have two identical timestamp so they are going to be written to the same file. Does MulitpleOutputs handle appending? Thanks. Kim On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen wrote: > Have you checked the content of the files you write? > > > /th > > On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: > > I have a simple M/R job using Mapper only thus no reducer. The mapper > > read a timestamp from the value, generate a path to the output file > > and writes the key and value to the output file. > > > > > > The input file is a sequence file, not compressed and stored in the > > HDFS, it has a size of 162.68 MB. > > > > > > Output also is written as a sequence file. > > > > > > > > However, after I ran my job, I have two output part files from the > > mapper. One has a size of 835.12 MB and the other has a size of 224.77 > > MB. So why is the total outputs size is so much larger? Shouldn't it > > be more or less equal to the input's size of 162.68MB since I just > > write the key and value passed to mapper to the output? > > > > > > Here is the mapper code snippet, > > > > public void map(BytesWritable key, BytesWritable value, Context > > context) throws IOException, InterruptedException { > > > > long timestamp = bytesToInt(value.getBytes(), > > TIMESTAMP_INDEX);; > > String tsStr = sdf.format(new Date(timestamp * 1000L)); > > > > mos.write(key, value, generateFileName(tsStr)); // mos is a > > MultipleOutputs object. > > } > > > > private String generateFileName(String key) { > > return outputDir+"/"+key+"/raw-vectors"; > > } > > > > > > And here are the job outputs, > > > > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2 > > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2 > > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 > > 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format > > Counters > > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0 > > 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters > > 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386 > > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272 > > 14/03/27 11:00:56 INFO mapred.JobClient: > > HDFS_BYTES_WRITTEN=374798 > > 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters > > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415 > > 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework > > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547 > > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes) > > snapshot=166428672 > > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0 > > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap > > usage (bytes)=38351872 > > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080 > > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes) > > snapshot=1240104960 > > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286 > > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0 > > > > > > TIA, > > > > > > Kim > > > > >
Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?
Yea, gonna do that. 8-) Kim On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen wrote: > Have you checked the content of the files you write? > > > /th > > On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: > > I have a simple M/R job using Mapper only thus no reducer. The mapper > > read a timestamp from the value, generate a path to the output file > > and writes the key and value to the output file. > > > > > > The input file is a sequence file, not compressed and stored in the > > HDFS, it has a size of 162.68 MB. > > > > > > Output also is written as a sequence file. > > > > > > > > However, after I ran my job, I have two output part files from the > > mapper. One has a size of 835.12 MB and the other has a size of 224.77 > > MB. So why is the total outputs size is so much larger? Shouldn't it > > be more or less equal to the input's size of 162.68MB since I just > > write the key and value passed to mapper to the output? > > > > > > Here is the mapper code snippet, > > > > public void map(BytesWritable key, BytesWritable value, Context > > context) throws IOException, InterruptedException { > > > > long timestamp = bytesToInt(value.getBytes(), > > TIMESTAMP_INDEX);; > > String tsStr = sdf.format(new Date(timestamp * 1000L)); > > > > mos.write(key, value, generateFileName(tsStr)); // mos is a > > MultipleOutputs object. > > } > > > > private String generateFileName(String key) { > > return outputDir+"/"+key+"/raw-vectors"; > > } > > > > > > And here are the job outputs, > > > > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2 > > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2 > > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 > > 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format > > Counters > > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0 > > 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters > > 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386 > > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272 > > 14/03/27 11:00:56 INFO mapred.JobClient: > > HDFS_BYTES_WRITTEN=374798 > > 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters > > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415 > > 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework > > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547 > > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes) > > snapshot=166428672 > > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0 > > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap > > usage (bytes)=38351872 > > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080 > > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes) > > snapshot=1240104960 > > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286 > > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0 > > > > > > TIA, > > > > > > Kim > > > > >
Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?
Have you checked the content of the files you write? /th On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote: > I have a simple M/R job using Mapper only thus no reducer. The mapper > read a timestamp from the value, generate a path to the output file > and writes the key and value to the output file. > > > The input file is a sequence file, not compressed and stored in the > HDFS, it has a size of 162.68 MB. > > > Output also is written as a sequence file. > > > > However, after I ran my job, I have two output part files from the > mapper. One has a size of 835.12 MB and the other has a size of 224.77 > MB. So why is the total outputs size is so much larger? Shouldn't it > be more or less equal to the input's size of 162.68MB since I just > write the key and value passed to mapper to the output? > > > Here is the mapper code snippet, > > public void map(BytesWritable key, BytesWritable value, Context > context) throws IOException, InterruptedException { > > long timestamp = bytesToInt(value.getBytes(), > TIMESTAMP_INDEX);; > String tsStr = sdf.format(new Date(timestamp * 1000L)); > > mos.write(key, value, generateFileName(tsStr)); // mos is a > MultipleOutputs object. > } > > private String generateFileName(String key) { > return outputDir+"/"+key+"/raw-vectors"; > } > > > And here are the job outputs, > > 14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2 > 14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2 > 14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 > 14/03/27 11:00:56 INFO mapred.JobClient: File Output Format > Counters > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0 > 14/03/27 11:00:56 INFO mapred.JobClient: FileSystemCounters > 14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386 > 14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272 > 14/03/27 11:00:56 INFO mapred.JobClient: > HDFS_BYTES_WRITTEN=374798 > 14/03/27 11:00:56 INFO mapred.JobClient: File Input Format Counters > 14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415 > 14/03/27 11:00:56 INFO mapred.JobClient: Map-Reduce Framework > 14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547 > 14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes) > snapshot=166428672 > 14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0 > 14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap > usage (bytes)=38351872 > 14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080 > 14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes) > snapshot=1240104960 > 14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286 > 14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0 > > > TIA, > > > Kim >