[ https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471919#comment-13471919 ]
Prasanth J commented on PIG-2831: --------------------------------- Updated the patch with following changes 1) Partition factor algorithm is tweaked to better distributed the reducer workload. 2) Partition factor in PartitionLargeGroups UDF is initialized to 0 (earlier it was 1), which generates many smaller bags (depends on cardinality of algebraic attribute). Earlier method initialized to 1 which generated few large bags. The above changes also reduced the amount of records/bags spilled during full cube materialization job. In a test experiment, with 3M tuples and rollup on 3 dimensions following improvements were observed with the above changes PROACTIVE_SPILL_COUNT_RECS improved by ~34% (from 5206793 to 3440694) PROACTIVE_SPILL_COUNT_BAGS improved by ~54% (from 22 to 10) > MR-Cube implementation (Distributed cubing for holistic measures) > ----------------------------------------------------------------- > > Key: PIG-2831 > URL: https://issues.apache.org/jira/browse/PIG-2831 > Project: Pig > Issue Type: Sub-task > Reporter: Prasanth J > Assignee: Prasanth J > Attachments: PIG-2831.1.git.patch, PIG-2831.2.git.patch, > PIG-2831.3.git.patch, PIG-2831.4.git.patch, PIG-2831.5.git.patch, > PIG-2831.6.git.patch, PIG-2831.7.git.patch, PIG-2831.8.git.patch, > PIG-2831.9.git.patch > > > Implementing distributed cube materialization on holistic measure based on > MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. > Primary steps involved: > 1) Identify if the measure is holistic or not > 2) Determine algebraic attribute (can be detected automatically for few > cases, if automatic detection fails user should hint the algebraic attribute) > 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm > and generates annotated cube lattice (contains large group partitioning > information) > 4) Modify plan to distribute annotated cube lattice to all mappers using > distributed cache > 5) Execute actual cube materialization on full dataset > 6) Modify MRPlan to insert a post process job for combining the results of > actual cube materialization job > 7) OOM exception handling -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira