[jira] [Commented] (PIG-3979) group all performance, garbage collection, and incremental aggregation

Rohini Palaniswamy (JIRA) Thu, 16 Oct 2014 13:59:55 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174230#comment-14174230
 ]


Rohini Palaniswamy commented on PIG-3979:
-----------------------------------------

bq. Debug messages can always be enabled. Separating Info and Debug messages 
allows better performance and smaller logs.
    These log messages do not actually cause much spam and are very helpful to 
analyze a running job or already completed job. If there are too many of spill 
messages, then there is a bigger problem and the job needs to be tuned. When 
there are production issues, it is not possible to go ask users to turn on 
debug logging to determine why a job is slow. All spill information of even 
hadoop is info logs because spilling is something you need to know about and 
should not be debug. Also turning on debug logging logs way too much. A system 
should have enough information to pinpoint often encountered scenarios without 
having to turn on debug logging.

bq. System.gc() isn't guaranteed to force GC; it is a suggestion. 
  Theoretically yes.  PIG-3148 just added extra GC to avoid big stale bags from 
being spilled and it did fix that issue. The SpillableManager relied on 
invoking System.gc() from the beginning and it has been working so far based on 
that. Just removing that is going to break a lot of existing jobs. 

Problem I see here is actually that 

{code}
estimatedFreed += toBeFreed;
accumulatedFreeSize += toBeFreed;
// This should significantly reduce the number of small files
                // in case that we have a lot of nested bags
                if (accumulatedFreeSize > gcActivationSize) {
                    invokeGC = true;
                }
{code}

may not exactly apply to POPartialAgg as it behaves differently from bag 
spills. Bags actually spill the data and free up the memory. While POPartialAgg 
only sets up the spill flag and does not actually spill till the next record is 
processed. So when System.gc() is invoked because POPartialAgg was > 40MB it 
actually is useless because the maps have not been processed and aggregated and 
there is no space freed yet. If that is fixed, I think it might work with the 
regular SpillableMemoryManager code. Changing SpillableMemoryManager to make 
POPartialAgg work should not cause regression issues with normal bags as that 
is more critical. 



> group all performance, garbage collection, and incremental aggregation
> ----------------------------------------------------------------------
>
>                 Key: PIG-3979
>                 URL: https://issues.apache.org/jira/browse/PIG-3979
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.12.0, 0.11.1
>            Reporter: David Dreyfus
>            Assignee: David Dreyfus
>             Fix For: 0.14.0
>
>         Attachments: PIG-3979-3.patch, PIG-3979-4.patch, PIG-3979-v1.patch, 
> POPartialAgg.java.patch, SpillableMemoryManager.java.patch
>
>
> I have a PIG statement similar to:
> summary = foreach (group data ALL) generate 
> COUNT(data.col1), SUM(data.col2), SUM(data.col2)
> , Moments(col3)
> , Moments(data.col4)
> There are a couple of hundred columns.
> I set the following:
> SET pig.exec.mapPartAgg true;
> SET pig.exec.mapPartAgg.minReduction 3;
> SET pig.cachedbag.memusage 0.05;
> I found that when I ran this on a JVM with insufficient memory, the process 
> eventually timed out because of an infinite garbage collection loop.
> The problem was invariant to the memusage setting.
> I solved the problem by making changes to:
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator.POPartialAgg.java
> Rather than reading in 10000 records to establish an estimate of the 
> reduction, I make an estimate after reading in enough tuples to fill 
> pig.cachedbag.memusage percent of Runtime.getRuntime().maxMemory().
> I also made a change to guarantee at least one record allowed in second tier 
> storage. In the current implementation, if the reduction is very high 1000:1, 
> space in second tier storage is zero.
> With these changes, I can summarize large data sets with small JVMs. I also 
> find that setting pig.cachedbag.memusage to a small number such as 0.05 
> results in much better garbage collection performance without reducing 
> throughput. I suppose tuning GC would also solve a problem with excessive 
> garbage collection.
> The performance is sweet. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-3979) group all performance, garbage collection, and incremental aggregation

Reply via email to