Hi Dan, You might want to post your Pig script to the Pig user mailing list. Previously I did some experiments on Pig and Hive and I'll also be interested in looking into your script.
Yeah Starfish now only supports Hadoop job-level tuning, and supporting workflow like Pig and Hive is our top priority. We'll let you know once we're ready. Thanks, Jie On Tue, Feb 28, 2012 at 11:57 AM, Daniel Baptista < daniel.bapti...@performgroup.com> wrote: > Hi Jie, > > To be honest I don't think I understand enough of what our job is doing to > be able to explain it. > > Thanks for the response though, I had figured that I was grasping at > straws. > > I have looped at Starfish however all our jobs are submitted via Apache > Pig so I don't know if it would be much good. > > Thanks again, Dan. > > -----Original Message----- > From: Jie Li [mailto:ji...@cs.duke.edu] > Sent: 28 February 2012 16:35 > To: common-user@hadoop.apache.org > Subject: Re: Spilled Records > > Hello Dan, > > The fact that the spilled records are double as the output records means > the map task produces more than one spill file, and these spill files are > read, merged and written to a single file, thus each record is spilled > twice. > > I can't infer anything from the numbers of the two tasks. Could you provide > more info, such as what the application is doing? > > If you like, you can also try our tool Starfish to see what's going on > behind. > > Thanks, > Jie > ------------------ > Starfish is an intelligent performance tuning tool for Hadoop. > Homepage: www.cs.duke.edu/starfish/ > Mailing list: http://groups.google.com/group/hadoop-starfish > > > On Tue, Feb 28, 2012 at 8:25 AM, Daniel Baptista < > daniel.bapti...@performgroup.com> wrote: > > > Hi All, > > > > I am trying to improve the performance of my hadoop cluster and would > like > > to get some feedback on a couple of numbers that I am seeing. > > > > Below is the output from a single task (1 of 16) that took 3 mins 40 > > Seconds > > > > FileSystemCounters > > FILE_BYTES_READ 214,653,748 > > HDFS_BYTES_READ 67,108,864 > > FILE_BYTES_WRITTEN 429,278,388 > > > > Map-Reduce Framework > > Combine output records 0 > > Map input records 2,221,478 > > Spilled Records 4,442,956 > > Map output bytes 210,196,148 > > Combine input records 0 > > Map output records 2,221,478 > > > > And another task in the same job (16 of 16) that took 7 minutes and 19 > > seconds > > > > FileSystemCounters > > FILE_BYTES_READ 199,003,192 > > HDFS_BYTES_READ 58,434,476 > > FILE_BYTES_WRITTEN 397,975,310 > > > > Map-Reduce Framework > > Combine output records 0 > > Map input records 2,086,789 > > Spilled Records 4,173,578 Map output bytes > > 194,813,958 > > Combine input records 0 Map output records 2,086,789 > > > > Can anybody determine anything from these figures? > > > > The first task is twice as quick as the second yet the input and output > > are comparable (certainly not double). In all of the tasks (in this and > > other jobs) the spilled records are always double the output records, > this > > can't be 'normal'? > > > > Am I clutching at straws (it feels like I am). > > > > Thanks in advance, Dan. > > > > > >