[
https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447460#comment-13447460
]
Prasanth J commented on PIG-2831:
---------------------------------
Hi Dmitriy,
I have implemented the new inter storage with statistics gathering and new
sample loader as per your idea on RB. Attached is the new patch containing the
following changes
1) Added new RichInterStorage which implements StoreMetadata and LoadMetadata
interfaces for storing and loading the statistics of intermediate data.
RichInterStorage uses RichRecordReader, RichInputFormat for reading
intermediate data and RichRecordWriter, RichOutputFormat for storing
intermediate data. RichRecordWriter and RichOutputFormat are the same as
InterRecordWriter and InterOutputFormat. The main difference is with the
RichRecordReader and RichInputFormat. The RichInputFormat wraps all the splits
to one logical split so that only one mapper is used for loading sample
dataset.
2) CubeSampleLoader uses underlying RichRecordReader for getting random samples
of data. RichRecordReader opens utmost 100 inner splits and chooses a random
split while reading the tuple.
3) Changes to PigOutputCommitter for storing statistics. Statistics are stored
at the end of every commitTask(). Statistics are stored for each output
partition. RichInterStorage takes care of loading all the statistics
corresponding to different partitions and aggregating them together. Statistics
stores the numberOfRows and avgInMemTupleSize for each partitions (only these
two values are required for holistic cubing).
This patch is quite bigger mainly because most of the changes (at the logical
layer) are due to an old formatting issue which I fixed in this patch. Sorry
about that.
I have also updated the patch in RB. Please review it and let me know your
feedback. Also I have kept some of the issues opened in your earlier review
comments which require some of your thoughts.
> MR-Cube implementation (Distributed cubing for holistic measures)
> -----------------------------------------------------------------
>
> Key: PIG-2831
> URL: https://issues.apache.org/jira/browse/PIG-2831
> Project: Pig
> Issue Type: Sub-task
> Reporter: Prasanth J
> Assignee: Prasanth J
> Attachments: PIG-2831.1.git.patch, PIG-2831.2.git.patch,
> PIG-2831.3.git.patch, PIG-2831.4.git.patch, PIG-2831.5.git.patch
>
>
> Implementing distributed cube materialization on holistic measure based on
> MR-Cube approach as described in http://arnab.org/files/mrcube.pdf.
> Primary steps involved:
> 1) Identify if the measure is holistic or not
> 2) Determine algebraic attribute (can be detected automatically for few
> cases, if automatic detection fails user should hint the algebraic attribute)
> 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm
> and generates annotated cube lattice (contains large group partitioning
> information)
> 4) Modify plan to distribute annotated cube lattice to all mappers using
> distributed cache
> 5) Execute actual cube materialization on full dataset
> 6) Modify MRPlan to insert a post process job for combining the results of
> actual cube materialization job
> 7) OOM exception handling
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira