Right, I meant the values of those numbers when spills happen :-) 2010/3/11 Richard Ding <rd...@yahoo-inc.com>
> These numbers are documented at > http://java.sun.com/javase/6/docs/api/java/lang/management/MemoryUsage.html > > Thanks, > -Richard > -----Original Message----- > From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] > Sent: Thursday, March 11, 2010 1:40 PM > To: pig-user@hadoop.apache.org > Subject: Re: Reducers slowing down? (UNCLASSIFIED) > > That's exactly what I am looking for -- if the longer-running reduce tasks > have a lot of lines with the SpillableMemoryManager info, the ones that run > relatively quickly do not, and what those used/committed/max numbers are. > > -D > > 2010/3/11 Winkler, Robert (Civ, ARL/CISD) <robert.wink...@us.army.mil> > > > Classification: UNCLASSIFIED > > Caveats: NONE > > > > Not sure I can figure out the correlation you're looking for but I'll > try. > > I > > do note that when the reduce tasks appear to stall the log's last entries > > look something like this: > > > > INFO: org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with > 3 > > segments left of total size: 363426183 bytes > > INFO: org.apache.pig.impl.util.SpillableMemoryManager: low memory handler > > called (Usage threshold exceeded) init=5439488 (5312K) used=195762912 > > (191174K) committed=225574912 (220288K) max=279642112 (273088K) > > > > Sometimes the last message is repeated multiple times, sometimes not. > > > > Thanks, > > Robert > > > > > > -----Original Message----- > > From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] > > Sent: Thursday, March 11, 2010 2:50 PM > > To: pig-user@hadoop.apache.org > > Subject: Re: Reducers slowing down? (UNCLASSIFIED) > > > > Can you check the task logs and see how the number of databag spills to > > disk > > correlates with the number of tuples/bytes processed and the time a task > > took? Sounds like there is some terrible skew going on, although cross > > really shouldn't have that problem if it does what I think it should be > > doing (which is probably wrong, I never use cross). > > > > -D > > > > 2010/3/11 Winkler, Robert (Civ, ARL/CISD) <robert.wink...@us.army.mil> > > > > > Classification: UNCLASSIFIED > > > Caveats: NONE > > > > > > Yeah, that didn't work either. Ran for 3 days and then failed because > > > of "too many fetch failures". It seems to get about 2/3 of the way > > > through the reducers (regardless of the number) reasonably quickly and > > > then just stalls or fails. > > > > > > Anyway, I changed the script to SPLIT the People dataset into 26 > > > subsets based on whether the first character matched a-z and crossed > > > each of those subsets with the Actors relation. This resulted in 26 > > > separate Pig jobs running in parallel (I went back to PARALLEL 30 so > each > > had 30 reducers). > > > > > > That worked. The shortest job took 53 minutes and the longest 22.5 > hours. > > > But I'm not sure what to make of this other than I shouldn't try to > > > process a 500,000,000,000-tuple relation. > > > > > > -- Register CMU’s SecondString > > > REGISTER > > > > > > /home/arl/Desktop/ARLDeveloper/JavaCOTS/SecondString/secondstring-2006 > > > 0615.j > > > ar; > > > -- Register ARL’s UDF SecondString Wrapper REGISTER > > > > > > /home/arl/Desktop/ARLDeveloper/JavaComponents/INSCOM/CandidateIdentifi > > > cation > > > .jar; > > > -- |People| ~ 62,500,000 > > > People = LOAD '/data/UniquePeoplePerStory' USING PigStorage(',') AS > > > (file:chararray, name:chararray); > > > -- Split People based on first character SPLIT People into A IF name > > > MATCHES ‘^[a|A].*’, …. , Z IF name MATCHES ‘^[z|Z].*’; > > > -- |Actors| ~ 8,000 > > > Actors = LOAD '/data/Actors' USING PigStorage(',') AS > > > (actor:chararray); > > > -- Process each split in parallel > > > ToCompareA = CROSS Actors, A PARALLEL 30; AResults = FOREACH > > > ToCompareA GENERATE $0, $1, $2, > > > ARL.CandidateIdentificationUDF.Similarity($2, $0) ; STORE AResults > > > INTO '/data/ScoredPeople/A' USING PigStorage(','); … ToCompareZ = > > > CROSS Actors, Z PARALLEL 30; ZResults = FOREACH ToCompareZ GENERATE > > > $0, $1, $2, ARL.CandidateIdentificationUDF.Similarity($2, $0) ; STORE > > > ZResults INTO '/data/ScoredPeople/Z' USING PigStorage(','); > > > > > > -----Original Message----- > > > From: Mridul Muralidharan [mailto:mrid...@yahoo-inc.com] > > > Sent: Friday, March 05, 2010 9:39 PM > > > To: pig-user@hadoop.apache.org > > > Cc: Thejas Nair; Winkler, Robert (Civ, ARL/CISD) > > > Subject: Re: Reducers slowing down? (UNCLASSIFIED) > > > > > > On Saturday 06 March 2010 04:47 AM, Thejas Nair wrote: > > > > I am not sure why the rate at which output is generated is slowing > > down. > > > > But cross in pig is not optimized it uses only one reducer. (a > > > > major limitation if you are trying to process lots of data with a > > > > large > > > cluster!) > > > > > > > > > CROSS is not supposed to use a single reducer - GRCross is parallel in > > > pig, last time we checked (a while back though). > > > It is parallel does not mean it is not expensive, it is still pretty > > > darn expensive. > > > > > > Given this, the next might not work ? > > > > > > > > > Robert, what about using a higher value of PARALLEL for CROSS ? (much > > > higher than number of nodes, if required). > > > > > > Regards, > > > Mridul > > > > > > > > > > > You can try using skewed join instead project a constant in both > > > streams > > > > and then join on that. > > > > > > > > > > > > ToCompare = join Actors by 1, People by 1 using Œskewed¹ PARALLEL > > > > 30; > > > > > > > > I haven¹t tried this on very large dataset, I am interested knowing > > > > in > > > how > > > > this compares if you try it out. > > > > > > > > -Thejas > > > > > > > > > > > > > > > > > > > > On 3/5/10 9:48 AM, "Winkler, Robert (Civ, ARL/CISD)" > > > > <robert.wink...@us.army.mil> wrote: > > > > > > > >> Classification: UNCLASSIFIED > > > >> > > > >> Caveats: NONE > > > >> > > > >> Hello, I¹m using pig0.6.0 running the following script on a 27 > > > >> datanode cluster running RedHat Enterprise 5.4: > > > >> > > > >> -- Holds the Pig UDF wrapper around the SecondString SoftTFIDF > > > function > > > >> > > > >> REGISTER /home/CandidateIdentification.jar; > > > >> > > > >> -- SecondString itself > > > >> > > > >> REGISTER /home/secondstring-20060615.jar; > > > >> > > > >> -- |People| ~ 62,500,000 from the English GigaWord 4th Edition > > > >> > > > >> People = LOAD '/data/UniquePeoplePerStory' USING PigStorage(',') AS > > > >> (file:chararray, name:chararray); > > > >> > > > >> -- |Actors| ~ 8,000 from the Stanford Movie Database > > > >> > > > >> Actors = LOAD '/data/Actors' USING PigStorage(',') AS > > > >> (actor:chararray); > > > >> > > > >> -- |ToCompare| ~ 500,000,000,000 > > > >> > > > >> ToCompare = CROSS Actors, People PARALLEL 30; > > > >> > > > >> > > > >> > > > >> -- Score 'em and store 'em > > > >> > > > >> Results = FOREACH ToCompare GENERATE $0, $1, $2, > > > >> ARL.CandidateIdentificationUDF.Similarity($2, $0); > > > >> > > > >> STORE Results INTO '/data/ScoredPeople' USING PigStorage(','); > > > >> > > > >> The first 100,000,000,000 reduce output records were produced in > > > >> some 25 hours. But after 75 hours it has produced a total of > > > >> 140,000,000,000 > > > (instead > > > >> of the 300,000,000,000 I was extrapolating) and seems to be > > > >> producing > > > them at > > > >> a slower and slower rate. What is going on? Did I screw something > up? > > > >> > > > >> Thanks, > > > >> > > > >> Robert > > > >> > > > >> Classification: UNCLASSIFIED > > > >> > > > >> Caveats: NONE > > > >> > > > >> > > > > > > > > > > > > > > Classification: UNCLASSIFIED > > > Caveats: NONE > > > > > > > > > > > Classification: UNCLASSIFIED > > Caveats: NONE > > > > > > >