seems like Big data big is still a headache for pig. 
here's a mail archive  I found : 
http://mail-archives.apache.org/mod_mbox/pig-user/200806.mbox/%[email protected]%3E

I've tried all the ways I can think of, and none works. 
I think I have to play some tricks inside Pig source code.



Haitao Yao
[email protected]
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-9,下午2:18, Haitao Yao 写道:

> there's also a reason of the OOM: I group the data by all , and the 
> parallelism is 1, With a big data bag, the reducer OOM 
> 
> after digging into the pig source code ,  I find out that replace the data 
> bag in BinSedesTuple is quite tricky, and maybe will cause other unknown 
> problems… 
> 
> Is there anybody else encounter the same problem? 
> 
> 
> Haitao Yao
> [email protected]
> weibo: @haitao_yao
> Skype:  haitao.yao.final
> 
> 在 2012-7-9,上午11:11, Haitao Yao 写道:
> 
>> sorry for the improper statement. 
>> The problem is the DataBag.  The BinSedesTuple read full  data of the 
>> DataBag. and while use COUNT for the data, it causes OOM.
>> The diagrams also shows that most of the objects is from the ArrayList.
>> 
>> I want to reimplement the DataBag that read by BinSedesTuple, it just holds 
>> the reference of the data input and read the data one by one while using 
>> iterator to access the data.
>> 
>> I will give a shot. 
>> 
>> Haitao Yao
>> [email protected]
>> weibo: @haitao_yao
>> Skype:  haitao.yao.final
>> 
>> 在 2012-7-6,下午11:06, Dmitriy Ryaboy 写道:
>> 
>>> BinSedesTuple is just the tuple, changing it won't do anything about the 
>>> fact that lots of tuples are being loaded.
>>> 
>>> The snippet you provided will not load all the data for computation, since 
>>> COUNT implements algebraic interface (partial counts will be done on 
>>> combiners).
>>> 
>>> Something else is causing tuples to be materialized. Are you using other 
>>> UDFs? Can you provide more details on the script? When you run "explain" on 
>>> "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
>>> 
>>> You can check the "pig.alias" property in the jobconf to identify which 
>>> relations are being calculated by a given MR job; that might help narrow 
>>> things down.
>>> 
>>> -Dmitriy
>>> 
>>> 
>>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[email protected]> wrote:
>>> hi,
>>>     I wrote a pig script that one of the reduces always OOM no matter how I 
>>> change the parallelism.
>>>         Here's the script snippet:
>>>             Data = group SourceData all;
>>>             Result = foreach Data generate group, COUNt(SourceData);
>>>             store Result into 'XX';
>>>     
>>>     I analyzed the dumped java heap,  and find out that the reason is that 
>>> the reducer load all the data for the foreach and count. 
>>> 
>>>     Can I re-implement the BinSedesTuple to avoid reducers load all the 
>>> data for computation? 
>>> 
>>> Here's the object domination tree:
>>> 
>>> 
>>> 
>>> here's the jmap result: 
>>> 
>>>  
>>> 
>>> Haitao Yao
>>> [email protected]
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>> 
>>> 
>> 
> 

Reply via email to