Hi Serega, Default task attempts = 4 --> Yes, 4 task attempts Do you use any "balancing" properties, for eaxmple pig.exec.reducers.bytes.per.reducer --> No
I suppose you have unbalanced data --> I guess so It's better to provide logs --> Unfortunately not possible any more "May be cleaned up by Task Tracker, if older logs" Regards, Nebo ________________________________________ From: Serega Sheypak [serega.shey...@gmail.com] Sent: Friday, January 10, 2014 2:32 PM To: user@pig.apache.org Subject: Re: Spilling issue - Optimize "GROUP BY" "and after trying it on several datanodes in the end it failes" Default task attempts = 4? 1. It's better to provde logs 2. Do you use any "balancing" properties, for eaxmple pig.exec.reducers.bytes.per.reducer ? I suppose you have unbalanced data 2014/1/10 Zebeljan, Nebojsa <nebojsa.zebel...@adtech.com> > Hi, > I'm encountering for a "simple" pig script, spilling issues. All map tasks > and reducers succeed pretty fast except the last reducer! > The last reducer always starts spilling after ~10mins and after trying it > on several datanodes in the end it failes. > > Do you have any idea, how I could optimize the GROUP BY, so I don't run > into spilling issues. > > Thanks in advance! > > Below the pig script: > ### > dataImport = LOAD <some data>; > generatedData = FOREACH dataImport GENERATE Field_A, Field_B, Field_C; > groupedData = GROUP generatedData BY (Field_B, Field_C); > > result = FOREACH groupedData { > counter_1 = FILTER generatedData BY <some fields>; > counter_2 = FILTER generatedData BY <some fields>; > GENERATE > group.Field_B, > group.Field_C, > COUNT(counter_1), > COUNT(counter_2); > } > > STORE result INTO <some path> USING PigStorage(); > ### > > Regards, > Nebo >