[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794217#action_12794217 ]
Sriranjan Manjunath commented on PIG-1102: ------------------------------------------ (3) refers to the case where we try to guess the number of records that fit into memory and start spilling the other records. InternalCachedBag.java addresses this case: + if (cacheLimit!= 0 && mContents.size() % cacheLimit == 0) { + /* Increment the spill count*/ + incSpillCount(PigCounters.PROACTIVE_SPILL_COUNT); + } } cacheLimit holds the number of records that can be held in memory whereas mContents is the tuple that holds all the records. Here, I do not increment the counter for every record. Instead I count every n'th record, n being the cacheLimit. This however, does not increment the counter by the buffer size. Incrementing it by the buffer size will give us a value which approximately equal to the number of spilled records. > Collect number of spills per job > -------------------------------- > > Key: PIG-1102 > URL: https://issues.apache.org/jira/browse/PIG-1102 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich > Assignee: Sriranjan Manjunath > Fix For: 0.7.0 > > Attachments: PIG_1102.patch, PIG_1102.patch.1 > > > Memory shortage is one of the main performance issues in Pig. Knowing when we > spill do the disk is useful for understanding query performance and also to > see how certain changes in Pig effect that. > Other interesting stats to collect would be average CPU usage and max mem > usage but I am not sure if this information is easily retrievable. > Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.