Can't attach teh pdf file that shows diffeent maps., File is too big,
From: Niels Basjes <ni...@basjes.nl> To: common-user@hadoop.apache.org; Raj V <rajv...@yahoo.com> Cc: Sent: Tuesday, January 11, 2011 11:07:08 AM Subject: Re: TeraSort question. Raj, Have a look at the graph shown here: http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_1.1_--_Generating_Task_Timelines It should make clear that the number of tasks varies greatly over the lifetime of a job. Depending on the nodes available this may leave node idle. Niels 2011/1/11 Raj V <rajv...@yahoo.com>: > Ted > > > Thanks. I have all the graphs I need that include, map reduce timeline, > system activity for all the nodes when the sort was running. I will publish > them once I have them in some presentable format., > > For legal reasons, I really don't want to send the complete job histiory > files. > > My question is still this. When running terasort, would the CPU, disk and > network utilization of all the nodes be more or less similar or completely > different. > > Sometime during the day, I will post the system data from 5 nodes and that > would probably explain my question better. > > Raj > From: Ted Dunning <tdunn...@maprtech.com> > To: common-user@hadoop.apache.org; Raj V <rajv...@yahoo.com> > Cc: > Sent: Tuesday, January 11, 2011 8:22:17 AM > Subject: Re: TeraSort question. > > Raj, > > Do you have the job history files? That would be very useful. I would be > happy to create some swimlane and related graphs for you if you can send me > the history files. > > On Mon, Jan 10, 2011 at 9:06 PM, Raj V <rajv...@yahoo.com> wrote: > >> All, >> >> I have been running terasort on a 480 node hadoop cluster. I have also >> collected cpu,memory,disk, network statistics during this run. The system >> stats are quite intersting. I can post it when I have put them together in >> some presentable format ( if there is interest.). However while looking at >> the data, I noticed something interesting. >> >> I thought, intutively, that the all the systems in the cluster would have >> more or less similar behaviour ( time translation was possible) but the >> overall graph would look the same., >> >> Just to confirm it I took 5 random nodes and looked at the CPU, disk >> ,network etc. activity when the sort was running. Strangeley enough, it was >> not so., Two of the 5 systems were seriously busy, big IO with lots of disk >> and network activity. The other three systems, CPU was more or less 100% >> idle, slight network and I/O. >> >> Is that normal and/or expected? SHouldn't all the nodes be utilized in more >> or less manner over the length of the run? >> >> I generated the data forf the sort using teragen. ( 128MB bloick size, >> replication =3). >> >> I would also be interested in other people timings of sort. Is there some >> place where people can post sort numbers ( not just the record.) >> >> I will post the actual graphs of the 5 nodes, if there is interest, >> tomorrow. ( Some logistical issues abt. posting them tonight) >> >> I am using CDH3B3, even though I think this is not specific to CDH3B3. >> >> Sorry for the cross post. >> >> Raj -- Met vriendelijke groeten, Niels Basjes