[
https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175345#comment-14175345
]
David Dreyfus commented on PIG-3979:
------------------------------------
For counting spilled bytes, I rely on the counters. Does this not work in your
use case?
If it doesn't and you want to have
{code}
if(estimatedFreed > 0){
String msg = "Spilled an estimate of " + estimatedFreed +
" bytes from " + numObjSpilled + " objects. " +
info.getUsage();;
log.info(msg);
}
{code}
I have no problem with that.
I understand how the extra GC solved PIG-3148. I also think it generates the
problem that causes PIG to augur itself into the ground.
I think the challenge is to come up with a better solution to PIG-3148 that
avoids spilling stale bags and avoids relying on multiple calls to GC to clean
stuff up.
> group all performance, garbage collection, and incremental aggregation
> ----------------------------------------------------------------------
>
> Key: PIG-3979
> URL: https://issues.apache.org/jira/browse/PIG-3979
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Affects Versions: 0.12.0, 0.11.1
> Reporter: David Dreyfus
> Assignee: David Dreyfus
> Fix For: 0.14.0
>
> Attachments: PIG-3979-3.patch, PIG-3979-4.patch, PIG-3979-v1.patch,
> POPartialAgg.java.patch, SpillableMemoryManager.java.patch
>
>
> I have a PIG statement similar to:
> summary = foreach (group data ALL) generate
> COUNT(data.col1), SUM(data.col2), SUM(data.col2)
> , Moments(col3)
> , Moments(data.col4)
> There are a couple of hundred columns.
> I set the following:
> SET pig.exec.mapPartAgg true;
> SET pig.exec.mapPartAgg.minReduction 3;
> SET pig.cachedbag.memusage 0.05;
> I found that when I ran this on a JVM with insufficient memory, the process
> eventually timed out because of an infinite garbage collection loop.
> The problem was invariant to the memusage setting.
> I solved the problem by making changes to:
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java
> Rather than reading in 10000 records to establish an estimate of the
> reduction, I make an estimate after reading in enough tuples to fill
> pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().
> I also made a change to guarantee at least one record allowed in second tier
> storage. In the current implementation, if the reduction is very high 1000:1,
> space in second tier storage is zero.
> With these changes, I can summarize large data sets with small JVMs. I also
> find that setting pig.cachedbag.memusage to a small number such as 0.05
> results in much better garbage collection performance without reducing
> throughput. I suppose tuning GC would also solve a problem with excessive
> garbage collection.
> The performance is sweet.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)