seems like Big data big is still a headache for pig. here's a mail archive I found : http://mail-archives.apache.org/mod_mbox/pig-user/200806.mbox/%[email protected]%3E
I've tried all the ways I can think of, and none works. I think I have to play some tricks inside Pig source code. Haitao Yao [email protected] weibo: @haitao_yao Skype: haitao.yao.final 在 2012-7-9,下午2:18, Haitao Yao 写道: > there's also a reason of the OOM: I group the data by all , and the > parallelism is 1, With a big data bag, the reducer OOM > > after digging into the pig source code , I find out that replace the data > bag in BinSedesTuple is quite tricky, and maybe will cause other unknown > problems… > > Is there anybody else encounter the same problem? > > > Haitao Yao > [email protected] > weibo: @haitao_yao > Skype: haitao.yao.final > > 在 2012-7-9,上午11:11, Haitao Yao 写道: > >> sorry for the improper statement. >> The problem is the DataBag. The BinSedesTuple read full data of the >> DataBag. and while use COUNT for the data, it causes OOM. >> The diagrams also shows that most of the objects is from the ArrayList. >> >> I want to reimplement the DataBag that read by BinSedesTuple, it just holds >> the reference of the data input and read the data one by one while using >> iterator to access the data. >> >> I will give a shot. >> >> Haitao Yao >> [email protected] >> weibo: @haitao_yao >> Skype: haitao.yao.final >> >> 在 2012-7-6,下午11:06, Dmitriy Ryaboy 写道: >> >>> BinSedesTuple is just the tuple, changing it won't do anything about the >>> fact that lots of tuples are being loaded. >>> >>> The snippet you provided will not load all the data for computation, since >>> COUNT implements algebraic interface (partial counts will be done on >>> combiners). >>> >>> Something else is causing tuples to be materialized. Are you using other >>> UDFs? Can you provide more details on the script? When you run "explain" on >>> "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc? >>> >>> You can check the "pig.alias" property in the jobconf to identify which >>> relations are being calculated by a given MR job; that might help narrow >>> things down. >>> >>> -Dmitriy >>> >>> >>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[email protected]> wrote: >>> hi, >>> I wrote a pig script that one of the reduces always OOM no matter how I >>> change the parallelism. >>> Here's the script snippet: >>> Data = group SourceData all; >>> Result = foreach Data generate group, COUNt(SourceData); >>> store Result into 'XX'; >>> >>> I analyzed the dumped java heap, and find out that the reason is that >>> the reducer load all the data for the foreach and count. >>> >>> Can I re-implement the BinSedesTuple to avoid reducers load all the >>> data for computation? >>> >>> Here's the object domination tree: >>> >>> >>> >>> here's the jmap result: >>> >>> >>> >>> Haitao Yao >>> [email protected] >>> weibo: @haitao_yao >>> Skype: haitao.yao.final >>> >>> >> >
