[jira] [Commented] (PIG-3979) group all performance, garbage collection, and incremental aggregation

David Dreyfus (JIRA) Thu, 16 Oct 2014 08:51:03 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173884#comment-14173884
 ]


David Dreyfus commented on PIG-3979:
------------------------------------

Hi Rohini,
1) Debug messages can always be enabled. Separating Info and Debug messages 
allows better performance and smaller logs.
2) System.gc() isn't guaranteed to force GC; it is a suggestion. The issue we 
ran into is the memory manager auguring itself into the ground with repeated 
looping. The thought is to let the JVM orchestrate its GC operations and to 
just use the notification as a chance to shrink memory used by spillable 
objects. I didn't determine if the auguring was due to a specific GC call or to 
the general issue of looping on GC calls. For PIG-3148, I would look at either 
threshold settings, releasing locks, or a method of not spilling something not 
worth spilling. Looping on GC doesn't seem to be a good solution to a memory 
management issue. 
3) I agree that duplication in code is probably not a great idea.

> group all performance, garbage collection, and incremental aggregation
> ----------------------------------------------------------------------
>
>                 Key: PIG-3979
>                 URL: https://issues.apache.org/jira/browse/PIG-3979
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.12.0, 0.11.1
>            Reporter: David Dreyfus
>            Assignee: David Dreyfus
>             Fix For: 0.14.0
>
>         Attachments: PIG-3979-3.patch, PIG-3979-4.patch, PIG-3979-v1.patch, 
> POPartialAgg.java.patch, SpillableMemoryManager.java.patch
>
>
> I have a PIG statement similar to:
> summary = foreach (group data ALL) generate 
> COUNT(data.col1), SUM(data.col2), SUM(data.col2)
> , Moments(col3)
> , Moments(data.col4)
> There are a couple of hundred columns.
> I set the following:
> SET pig.exec.mapPartAgg true;
> SET pig.exec.mapPartAgg.minReduction 3;
> SET pig.cachedbag.memusage 0.05;
> I found that when I ran this on a JVM with insufficient memory, the process 
> eventually timed out because of an infinite garbage collection loop.
> The problem was invariant to the memusage setting.
> I solved the problem by making changes to:
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java
> Rather than reading in 10000 records to establish an estimate of the 
> reduction, I make an estimate after reading in enough tuples to fill 
> pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().
> I also made a change to guarantee at least one record allowed in second tier 
> storage. In the current implementation, if the reduction is very high 1000:1, 
> space in second tier storage is zero.
> With these changes, I can summarize large data sets with small JVMs. I also 
> find that setting pig.cachedbag.memusage to a small number such as 0.05 
> results in much better garbage collection performance without reducing 
> throughput. I suppose tuning GC would also solve a problem with excessive 
> garbage collection.
> The performance is sweet. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-3979) group all performance, garbage collection, and incremental aggregation

Reply via email to