Hello, Fiesr of all thank you very much for your help. :)
I still have some doubt with this . Is the highlighted metric “ Reduce shuffle bytes=3059” 1. Is the total bytes after the reduced phase. ( That is the output file which is written into HDFS) Or 2. Is this is the actual shuffled traffic which is exchanged between mappers and reducers before performing reducing ? Please clarify ! Thanks & Regards, Abdul Navaz From: Pramod Biligiri <pramodbilig...@gmail.com> Reply-To: <user@hadoop.apache.org> Date: Thursday, October 2, 2014 at 12:44 AM To: "zookeeper-u...@hadoop.apache.org" <user@hadoop.apache.org> Subject: Re: Hadoop shuffling traffic Hi Abdul, That is the right metric. You can take a look at this report we made on this earlier: http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in- hadoop-terasort Pramod On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <navaz....@gmail.com> wrote: > Hello, > > This is the portion of the output which is displayed on the console when I run > sample word count job. > > map 0% reduce 0% > > 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0% > > 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100% > > 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003 > > 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29 > > 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters > > 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1 > > 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511 > > 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all reduces > waiting after reserving slots (ms)=0 > > 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps > waiting after reserving slots (ms)=0 > > 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1 > > 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1 > > 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193 > > 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters > > 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106 > > 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters > > 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059 > > 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601 > > 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400 > > 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106 > > 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters > > 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486 > > 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework > > 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized > bytes=3059 > > 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6 > > 14/10/01 18:38:12 INFO mapred.JobClient: Reduce shuffle bytes=3059 > > 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544 > > 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509 > > 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage > > > > I am trying to find the shuffling traffic that is total traffic generated when > mappers exchange their key values pair with the reducer. Is the highlighted > portion gives the shuffling traffic ? > > > Thanks & Regards, > > Abdul Navaz > Research Assistant > University of Houston Main Campus, Houston TX > Ph: 281-685-0388 <tel:281-685-0388> > > > > > On 9/26/14, 12:00 AM, "karthikeyan S" <karthispe...@gmail.com> wrote: > >> The reducer starts as soon as it has data available from any one of the >> mappers. >> The reducer keeps polling the AM and asks if any mapper has completed >> processing. If so it fetches data from that mapper. >> So it's not necessary for all the mappers of a task to complete for >> the reducer to start processing. >> >> When the reducers starts fetching the data from the mappers it prints >> that info in its syslog, from what I have seen. >> >> Thanks, >> Karthik >> >> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <jiangbinglo...@gmail.com> wrote: >>> see mapreduce.job.reduce.slowstart.completedmaps >>> It gives hint of when reduce tasks could kick off. >>> >>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <navaz....@gmail.com>: >>>> >>>> Hello, >>>> >>>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running >>>> sample word count job on 1GB of file which is distributed among the HDFS. >>>> >>>> When I run the map reduce job, before even completing the mapping 100 % >>>> reduce starts. Say for eg map 40% reduce 10% etc. >>>> >>>> I would like to know when the shuffling traffic starts ? >>>> >>>> -> Is there any way to find out when exactly shuffling started ? Does it >>>> generate any syslog in the logs . >>>> -> How to find the total amount of shuffling traffic? >>>> >>>> >>>> >>>> Thanks & Regards, >>>> >>>> Abdul Navaz >>>> Research Assistant >>>> University of Houston Main Campus, Houston TX >>>> Ph: 281-685-0388 <tel:281-685-0388> >>>> >>> >>> >>> >>> -- >>> Bing Jiang >>> Tel:(86)134-2619-1361 >>> weibo: http://weibo.com/jiangbinglover >>> BLOG: www.binospace.com <http://www.binospace.com> >>> BLOG: http://blog.sina.com.cn/jiangbinglover >>> Focus on distributed computing, HDFS/HBase >>