David Dreyfus created PIG-3979:
----------------------------------

             Summary: group all performance, garbage collection, and 
incremental aggregation
                 Key: PIG-3979
                 URL: https://issues.apache.org/jira/browse/PIG-3979
             Project: Pig
          Issue Type: Improvement
          Components: impl
    Affects Versions: 0.11.1, 0.12.0
            Reporter: David Dreyfus
             Fix For: 0.13.0


I have a PIG statement similar to:
summary = foreach (group data ALL) generate 
COUNT(data.col1), SUM(data.col2), SUM(data.col2)
, Moments(col3)
, Moments(data.col4)

There are a couple of hundred columns.

I set the following:
SET pig.exec.mapPartAgg true;
SET pig.exec.mapPartAgg.minReduction 3;
SET pig.cachedbag.memusage 0.05;

I found that when I ran this on a JVM with insufficient memory, the process 
eventually timed out because of an infinite garbage collection loop.

The problem was invariant to the memusage setting.

I solved the problem by making changes to:
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java

Rather than reading in 10000 records to establish an estimate of the reduction, 
I make an estimate after reading in enough tuples to fill 
pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().

I also made a change to guarantee at least one record allowed in second tier 
storage. In the current implementation, if the reduction is very high 1000:1, 
space in second tier storage is zero.

With these changes, I can summarize large data sets with small JVMs. I also 
find that setting pig.cachedbag.memusage to a small number such as 0.05 results 
in much better garbage collection performance without reducing throughput. I 
suppose tuning GC would also solve a problem with excessive garbage collection.

The performance is sweet. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to