Ruslan, Thanks for your reply in advance. Jobs' statistics are as follows;
case 1 : uncompressed data(none) 12/08/09 16:12:44 INFO mapred.JobClient: Job complete: job_201208021633_0049 12/08/09 16:12:44 INFO mapred.JobClient: Counters: 23 12/08/09 16:12:44 INFO mapred.JobClient: Job Counters 12/08/09 16:12:44 INFO mapred.JobClient: Launched reduce tasks=1 12/08/09 16:12:44 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3623053 12/08/09 16:12:44 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/08/09 16:12:44 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/08/09 16:12:44 INFO mapred.JobClient: Rack-local map tasks=1 12/08/09 16:12:44 INFO mapred.JobClient: Launched map tasks=166 12/08/09 16:12:44 INFO mapred.JobClient: Data-local map tasks=165 12/08/09 16:12:44 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=220786 12/08/09 16:12:44 INFO mapred.JobClient: FileSystemCounters 12/08/09 16:12:44 INFO mapred.JobClient: FILE_BYTES_READ=1852424288 12/08/09 16:12:44 INFO mapred.JobClient: HDFS_BYTES_READ=10644581454 12/08/09 16:12:44 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1894096220 12/08/09 16:12:44 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=211440 12/08/09 16:12:44 INFO mapred.JobClient: Map-Reduce Framework 12/08/09 16:12:44 INFO mapred.JobClient: Reduce input groups=13661 12/08/09 16:12:44 INFO mapred.JobClient: Combine output records=69055428 12/08/09 16:12:44 INFO mapred.JobClient: Map input records=158156100 12/08/09 16:12:44 INFO mapred.JobClient: Reduce shuffle bytes=33143186 12/08/09 16:12:44 INFO mapred.JobClient: Reduce output records=13661 12/08/09 16:12:44 INFO mapred.JobClient: Spilled Records=122916251 12/08/09 16:12:44 INFO mapred.JobClient: Map output bytes=15704921900 12/08/09 16:12:44 INFO mapred.JobClient: Combine input records=1332132129 12/08/09 16:12:44 INFO mapred.JobClient: Map output records=1265248800 12/08/09 16:12:44 INFO mapred.JobClient: SPLIT_RAW_BYTES=19716 12/08/09 16:12:44 INFO mapred.JobClient: Reduce input records=2172099 case2 : lzo 12/08/09 15:58:11 INFO mapred.JobClient: Job complete: job_201208021633_0048 12/08/09 15:58:11 INFO mapred.JobClient: Counters: 23 12/08/09 15:58:11 INFO mapred.JobClient: Job Counters 12/08/09 15:58:11 INFO mapred.JobClient: Launched reduce tasks=1 12/08/09 15:58:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3361287 12/08/09 15:58:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/08/09 15:58:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/08/09 15:58:11 INFO mapred.JobClient: Rack-local map tasks=4 12/08/09 15:58:11 INFO mapred.JobClient: Launched map tasks=65 12/08/09 15:58:11 INFO mapred.JobClient: Data-local map tasks=61 12/08/09 15:58:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=183529 12/08/09 15:58:11 INFO mapred.JobClient: FileSystemCounters 12/08/09 15:58:11 INFO mapred.JobClient: FILE_BYTES_READ=568178351 12/08/09 15:58:11 INFO mapred.JobClient: HDFS_BYTES_READ=3860287251 12/08/09 15:58:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=576095398 12/08/09 15:58:11 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=211440 12/08/09 15:58:11 INFO mapred.JobClient: Map-Reduce Framework 12/08/09 15:58:11 INFO mapred.JobClient: Reduce input groups=13661 12/08/09 15:58:11 INFO mapred.JobClient: Combine output records=66734193 12/08/09 15:58:11 INFO mapred.JobClient: Map input records=158156100 12/08/09 15:58:11 INFO mapred.JobClient: Reduce shuffle bytes=4752406 12/08/09 15:58:11 INFO mapred.JobClient: Reduce output records=13661 12/08/09 15:58:11 INFO mapred.JobClient: Spilled Records=132612729 12/08/09 15:58:11 INFO mapred.JobClient: Map output bytes=15704921900 12/08/09 15:58:11 INFO mapred.JobClient: Combine input records=1331190655 12/08/09 15:58:11 INFO mapred.JobClient: Map output records=1265248800 12/08/09 15:58:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=7366 12/08/09 15:58:11 INFO mapred.JobClient: Reduce input records=792338 case3 : sequence file compressed block-level by snappy 12/09/05 18:33:00 INFO mapred.JobClient: Job complete: job_201209051652_0008 12/09/05 18:33:00 INFO mapred.JobClient: Counters: 23 12/09/05 18:33:00 INFO mapred.JobClient: Job Counters 12/09/05 18:33:00 INFO mapred.JobClient: Launched reduce tasks=1 12/09/05 18:33:00 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5885897 12/09/05 18:33:00 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/09/05 18:33:00 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/09/05 18:33:00 INFO mapred.JobClient: Rack-local map tasks=2 12/09/05 18:33:00 INFO mapred.JobClient: Launched map tasks=68 12/09/05 18:33:00 INFO mapred.JobClient: Data-local map tasks=66 12/09/05 18:33:00 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1320075 12/09/05 18:33:00 INFO mapred.JobClient: FileSystemCounters 12/09/05 18:33:00 INFO mapred.JobClient: FILE_BYTES_READ=3706936196 12/09/05 18:33:00 INFO mapred.JobClient: HDFS_BYTES_READ=4419150507 12/09/05 18:33:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=4581439981 12/09/05 18:33:00 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=211440 12/09/05 18:33:00 INFO mapred.JobClient: Map-Reduce Framework 12/09/05 18:33:00 INFO mapred.JobClient: Reduce input groups=13661 12/09/05 18:33:00 INFO mapred.JobClient: Combine output records=0 12/09/05 18:33:00 INFO mapred.JobClient: Map input records=158156100 12/09/05 18:33:00 INFO mapred.JobClient: Reduce shuffle bytes=857964933 12/09/05 18:33:00 INFO mapred.JobClient: Reduce output records=13661 12/09/05 18:33:00 INFO mapred.JobClient: Spilled Records=6232725043 12/09/05 18:33:00 INFO mapred.JobClient: Map output bytes=15704921900 12/09/05 18:33:00 INFO mapred.JobClient: Combine input records=0 12/09/05 18:33:00 INFO mapred.JobClient: Map output records=1265248800 12/09/05 18:33:00 INFO mapred.JobClient: SPLIT_RAW_BYTES=8382 12/09/05 18:33:00 INFO mapred.JobClient: Reduce input records=1265248800 Regards, Park 2012/9/7 Ruslan Al-Fakikh <[email protected]> > Hi, > > I would be interesting to see the jobs' statistics (counters). > > Thanks > > On Fri, Sep 7, 2012 at 3:25 AM, Young-Geun Park > <[email protected]> wrote: > > Hi, All > > > > I have tested which method is better between Lzo and SequenceFile for a > BIG > > file. > > > > File size is 10GiB and WordCount MR is used. > > Inputs of WordCount MR are lzo which would be indexed by > LzoIndexTool(lzo), > > sequence file which is compressed by block level snappy(seq) , and > > uncompressed original file(none). > > > > Map output is compressed except of uncompressed file. mapreduce output > is > > not compressed for all cases. > > > > The following are wordcount MR running time; > > none lzo seq > > 248s 243s 1410s > > > > -Test Environments > > > > OS : CentOS 5.6 (x64) (kernel = 2.6.18) > > # of Core : 8 (cpu = Intel(R) Xeon(R) CPU E5504 @ 2.00GHz) > > RAM : 18GB > > Java version : 1.6.0_26 > > Hadoop version : CDH3U2 > > # of datanode(tasktracker) : 8 > > > > According to the result, The running time of SequnceFile is much less > than > > the others. > > Before testing, I had expected that the results of both SequenceFile and > > Lzo are about the same. > > > > I want to know why performance of the sequence file compressed by snappy > is > > so bad? > > > > do I miss anything in tests? > > > > > > Regards, > > Park > > > > > > > > -- > Best Regards, > Ruslan Al-Fakikh >
