That's exactly what I am looking for -- if the longer-running reduce tasks
have a lot of lines with the SpillableMemoryManager info, the ones that run
relatively quickly do not, and what those used/committed/max numbers are.

-D

2010/3/11 Winkler, Robert (Civ, ARL/CISD) <robert.wink...@us.army.mil>

> Classification: UNCLASSIFIED
> Caveats: NONE
>
> Not sure I can figure out the correlation you're looking for but I'll try.
> I
> do note that when the reduce tasks appear to stall the log's last entries
> look something like this:
>
> INFO: org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 3
> segments left of total size: 363426183 bytes
> INFO: org.apache.pig.impl.util.SpillableMemoryManager: low memory handler
> called (Usage threshold exceeded) init=5439488 (5312K) used=195762912
> (191174K) committed=225574912 (220288K) max=279642112 (273088K)
>
> Sometimes the last message is repeated multiple times, sometimes not.
>
> Thanks,
> Robert
>
>
> -----Original Message-----
> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
> Sent: Thursday, March 11, 2010 2:50 PM
> To: pig-user@hadoop.apache.org
> Subject: Re: Reducers slowing down? (UNCLASSIFIED)
>
> Can you check the task logs and see how the number of databag spills to
> disk
> correlates with the number of tuples/bytes processed and the time a task
> took? Sounds like there is some terrible skew going on, although cross
> really shouldn't have that problem if it does what I think it should be
> doing (which is probably wrong, I never use cross).
>
> -D
>
> 2010/3/11 Winkler, Robert (Civ, ARL/CISD) <robert.wink...@us.army.mil>
>
> > Classification: UNCLASSIFIED
> > Caveats: NONE
> >
> > Yeah, that didn't work either. Ran for 3 days and then failed because
> > of "too many fetch failures". It seems to get about 2/3 of the way
> > through the reducers (regardless of the number) reasonably quickly and
> > then just stalls or fails.
> >
> > Anyway, I changed the script to SPLIT the People dataset into 26
> > subsets based on whether the first character matched a-z and crossed
> > each of those subsets with the Actors relation. This resulted in 26
> > separate Pig jobs running in parallel (I went back to PARALLEL 30 so each
> had 30 reducers).
> >
> > That worked. The shortest job took 53 minutes and the longest 22.5 hours.
> > But I'm not sure what to make of this other than I shouldn't try to
> > process a 500,000,000,000-tuple relation.
> >
> > -- Register CMU’s SecondString
> > REGISTER
> >
> > /home/arl/Desktop/ARLDeveloper/JavaCOTS/SecondString/secondstring-2006
> > 0615.j
> > ar;
> > -- Register ARL’s UDF SecondString Wrapper REGISTER
> >
> > /home/arl/Desktop/ARLDeveloper/JavaComponents/INSCOM/CandidateIdentifi
> > cation
> > .jar;
> > -- |People| ~ 62,500,000
> > People = LOAD '/data/UniquePeoplePerStory' USING PigStorage(',') AS
> > (file:chararray, name:chararray);
> > -- Split People based on first character SPLIT People into A IF name
> > MATCHES ‘^[a|A].*’, …. , Z IF name MATCHES ‘^[z|Z].*’;
> > -- |Actors| ~ 8,000
> > Actors = LOAD '/data/Actors' USING PigStorage(',') AS
> > (actor:chararray);
> > -- Process each split in parallel
> > ToCompareA = CROSS Actors, A PARALLEL 30; AResults = FOREACH
> > ToCompareA GENERATE $0, $1, $2,
> > ARL.CandidateIdentificationUDF.Similarity($2, $0) ; STORE AResults
> > INTO '/data/ScoredPeople/A' USING PigStorage(','); … ToCompareZ =
> > CROSS Actors, Z PARALLEL 30; ZResults = FOREACH ToCompareZ GENERATE
> > $0, $1, $2, ARL.CandidateIdentificationUDF.Similarity($2, $0) ; STORE
> > ZResults INTO '/data/ScoredPeople/Z' USING PigStorage(',');
> >
> > -----Original Message-----
> > From: Mridul Muralidharan [mailto:mrid...@yahoo-inc.com]
> > Sent: Friday, March 05, 2010 9:39 PM
> > To: pig-user@hadoop.apache.org
> > Cc: Thejas Nair; Winkler, Robert (Civ, ARL/CISD)
> > Subject: Re: Reducers slowing down? (UNCLASSIFIED)
> >
> > On Saturday 06 March 2010 04:47 AM, Thejas Nair wrote:
> > > I am not sure why the rate at which output is generated is slowing
> down.
> > > But cross in pig is not optimized ­ it uses only one reducer. (a
> > > major limitation if you are trying to process lots of data with a
> > > large
> > cluster!)
> >
> >
> > CROSS is not supposed to use a single reducer - GRCross is parallel in
> > pig, last time we checked (a while back though).
> > It is parallel does not mean it is not expensive, it is still pretty
> > darn expensive.
> >
> > Given this, the next might not work ?
> >
> >
> > Robert, what about using a higher value of PARALLEL for CROSS ? (much
> > higher than number of nodes, if required).
> >
> > Regards,
> > Mridul
> >
> > >
> > > You can try using skewed join instead ­ project a constant in both
> > streams
> > > and then join on that.
> > >
> > >
> > > ToCompare = join Actors by 1, People by 1 using Œskewed¹ PARALLEL
> > > 30;
> > >
> > > I haven¹t tried this on very large dataset, I am interested knowing
> > > in
> > how
> > > this compares if you try it out.
> > >
> > > -Thejas
> > >
> > >
> > >
> > >
> > > On 3/5/10 9:48 AM, "Winkler, Robert  (Civ, ARL/CISD)"
> > > <robert.wink...@us.army.mil>  wrote:
> > >
> > >> Classification: UNCLASSIFIED
> > >>
> > >> Caveats: NONE
> > >>
> > >> Hello, I¹m using pig0.6.0 running the following script on a 27
> > >> datanode cluster running RedHat Enterprise 5.4:
> > >>
> > >>   -- Holds the Pig UDF wrapper around the SecondString SoftTFIDF
> > function
> > >>
> > >> REGISTER /home/CandidateIdentification.jar;
> > >>
> > >> -- SecondString itself
> > >>
> > >> REGISTER /home/secondstring-20060615.jar;
> > >>
> > >> -- |People| ~ 62,500,000 from the English GigaWord 4th Edition
> > >>
> > >> People = LOAD '/data/UniquePeoplePerStory' USING PigStorage(',') AS
> > >> (file:chararray, name:chararray);
> > >>
> > >> -- |Actors| ~ 8,000 from the Stanford Movie Database
> > >>
> > >> Actors = LOAD '/data/Actors' USING PigStorage(',') AS
> > >> (actor:chararray);
> > >>
> > >> -- |ToCompare| ~ 500,000,000,000
> > >>
> > >> ToCompare = CROSS Actors, People PARALLEL 30;
> > >>
> > >>
> > >>
> > >> -- Score 'em and store 'em
> > >>
> > >> Results = FOREACH ToCompare GENERATE $0, $1, $2,
> > >> ARL.CandidateIdentificationUDF.Similarity($2, $0);
> > >>
> > >> STORE Results INTO '/data/ScoredPeople' USING PigStorage(',');
> > >>
> > >> The first 100,000,000,000 reduce output records were produced in
> > >> some 25 hours. But after 75 hours it has produced a total of
> > >> 140,000,000,000
> > (instead
> > >> of the 300,000,000,000 I was extrapolating) and seems to be
> > >> producing
> > them at
> > >> a slower and slower rate. What is going on? Did I screw something up?
> > >>
> > >> Thanks,
> > >>
> > >> Robert
> > >>
> > >> Classification: UNCLASSIFIED
> > >>
> > >> Caveats: NONE
> > >>
> > >>
> > >
> > >
> >
> > Classification: UNCLASSIFIED
> > Caveats: NONE
> >
> >
> >
> Classification: UNCLASSIFIED
> Caveats: NONE
>
>
>

Reply via email to