sorry for the improper statement. 
The problem is the DataBag.  The BinSedesTuple read full  data of the DataBag. 
and while use COUNT for the data, it causes OOM.
The diagrams also shows that most of the objects is from the ArrayList.

I want to reimplement the DataBag that read by BinSedesTuple, it just holds the 
reference of the data input and read the data one by one while using iterator 
to access the data.

I will give a shot. 

Haitao Yao
[email protected]
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-6,下午11:06, Dmitriy Ryaboy 写道:

> BinSedesTuple is just the tuple, changing it won't do anything about the fact 
> that lots of tuples are being loaded.
> 
> The snippet you provided will not load all the data for computation, since 
> COUNT implements algebraic interface (partial counts will be done on 
> combiners).
> 
> Something else is causing tuples to be materialized. Are you using other 
> UDFs? Can you provide more details on the script? When you run "explain" on 
> "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
> 
> You can check the "pig.alias" property in the jobconf to identify which 
> relations are being calculated by a given MR job; that might help narrow 
> things down.
> 
> -Dmitriy
> 
> 
> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[email protected]> wrote:
> hi,
>       I wrote a pig script that one of the reduces always OOM no matter how I 
> change the parallelism.
>         Here's the script snippet:
>               Data = group SourceData all;
>               Result = foreach Data generate group, COUNt(SourceData);
>               store Result into 'XX';
>       
>       I analyzed the dumped java heap,  and find out that the reason is that 
> the reducer load all the data for the foreach and count. 
> 
>       Can I re-implement the BinSedesTuple to avoid reducers load all the 
> data for computation? 
> 
> Here's the object domination tree:
> 
> 
> 
> here's the jmap result: 
> 
>  
> 
> Haitao Yao
> [email protected]
> weibo: @haitao_yao
> Skype:  haitao.yao.final
> 
> 

Reply via email to