Like I said earlier, if all you are doing is count, the data bag should not be 
growing. On the reduce side, it'll just be a bag of counts from each reducer. 
Something else is happening that's preventing the algebraic and accumulative 
optimizations from kicking in. Can you share a minimal script that reproduces 
the problem for you?

On Jul 9, 2012, at 3:24 AM, Haitao Yao <[email protected]> wrote:

> seems like Big data big is still a headache for pig. 
> here's a mail archive  I found : 
> http://mail-archives.apache.org/mod_mbox/pig-user/200806.mbox/%[email protected]%3E
> 
> I've tried all the ways I can think of, and none works. 
> I think I have to play some tricks inside Pig source code.
> 
> 
> 
> Haitao Yao
> [email protected]
> weibo: @haitao_yao
> Skype:  haitao.yao.final
> 
> 在 2012-7-9,下午2:18, Haitao Yao 写道:
> 
>> there's also a reason of the OOM: I group the data by all , and the 
>> parallelism is 1, With a big data bag, the reducer OOM 
>> 
>> after digging into the pig source code ,  I find out that replace the data 
>> bag in BinSedesTuple is quite tricky, and maybe will cause other unknown 
>> problems… 
>> 
>> Is there anybody else encounter the same problem? 
>> 
>> 
>> Haitao Yao
>> [email protected]
>> weibo: @haitao_yao
>> Skype:  haitao.yao.final
>> 
>> 在 2012-7-9,上午11:11, Haitao Yao 写道:
>> 
>>> sorry for the improper statement. 
>>> The problem is the DataBag.  The BinSedesTuple read full  data of the 
>>> DataBag. and while use COUNT for the data, it causes OOM.
>>> The diagrams also shows that most of the objects is from the ArrayList.
>>> 
>>> I want to reimplement the DataBag that read by BinSedesTuple, it just holds 
>>> the reference of the data input and read the data one by one while using 
>>> iterator to access the data.
>>> 
>>> I will give a shot. 
>>> 
>>> Haitao Yao
>>> [email protected]
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>> 
>>> 在 2012-7-6,下午11:06, Dmitriy Ryaboy 写道:
>>> 
>>>> BinSedesTuple is just the tuple, changing it won't do anything about the 
>>>> fact that lots of tuples are being loaded.
>>>> 
>>>> The snippet you provided will not load all the data for computation, since 
>>>> COUNT implements algebraic interface (partial counts will be done on 
>>>> combiners).
>>>> 
>>>> Something else is causing tuples to be materialized. Are you using other 
>>>> UDFs? Can you provide more details on the script? When you run "explain" 
>>>> on "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
>>>> 
>>>> You can check the "pig.alias" property in the jobconf to identify which 
>>>> relations are being calculated by a given MR job; that might help narrow 
>>>> things down.
>>>> 
>>>> -Dmitriy
>>>> 
>>>> 
>>>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[email protected]> wrote:
>>>> hi,
>>>>    I wrote a pig script that one of the reduces always OOM no matter how I 
>>>> change the parallelism.
>>>>        Here's the script snippet:
>>>>        Data = group SourceData all;
>>>>        Result = foreach Data generate group, COUNt(SourceData);
>>>>        store Result into 'XX';
>>>>    
>>>>    I analyzed the dumped java heap,  and find out that the reason is that 
>>>> the reducer load all the data for the foreach and count. 
>>>> 
>>>>    Can I re-implement the BinSedesTuple to avoid reducers load all the 
>>>> data for computation? 
>>>> 
>>>> Here's the object domination tree:
>>>> 
>>>> 
>>>> 
>>>> here's the jmap result: 
>>>> 
>>>> 
>>>> 
>>>> Haitao Yao
>>>> [email protected]
>>>> weibo: @haitao_yao
>>>> Skype:  haitao.yao.final
>>>> 
>>>> 
>>> 
>> 
> 

Reply via email to