[
https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437077#comment-13437077
]
Prasanth J commented on PIG-2831:
---------------------------------
Yes. It's true that skewed join and order by forces the data to be written to
disk in a map-only job and then use PoissonSampleLoader/RandomSampleLoader
resp. PoissonSampleLoader loads n tuples from the dataset based on the join key
distribution and appends a special tuple at the end with the number of tuples
loaded info. Whereas, RandomSampleLoader just uses 100 tuples to be loaded from
each mapper. PoissonSampleLoader is definitely not applicable for our case.
RandomSampleLoader can be used but we need to specify how many samples to load
per mapper based on the overall datasize. I think this method will also be not
reliable because it may lead to oversampling or undersampling. Also we need to
know the number of mappers before specifying the #samples per mapper. One more
disadvantage with this approach is the cost of one map-only job. This will be
very expensive if the datasize is too big. I also noted that after the dataset
is forcefully copied to disk the overall size gets increased because of
InterStorage format.
Performance wise I found the current approach of using SAMPLE operator to be
much faster. The entire sample extraction happens within few mins (1 min 23s
for ~100K samples from 100M tuples). Also this doesn't cost addition map job
and saves space.
I like the idea of using LoadMetadata approach but until we have HCatalog work
integrated we may not be able to use it.
> MR-Cube implementation (Distributed cubing for holistic measures)
> -----------------------------------------------------------------
>
> Key: PIG-2831
> URL: https://issues.apache.org/jira/browse/PIG-2831
> Project: Pig
> Issue Type: Sub-task
> Reporter: Prasanth J
> Assignee: Prasanth J
> Attachments: PIG-2831.1.git.patch, PIG-2831.2.git.patch,
> PIG-2831.3.git.patch
>
>
> Implementing distributed cube materialization on holistic measure based on
> MR-Cube approach as described in http://arnab.org/files/mrcube.pdf.
> Primary steps involved:
> 1) Identify if the measure is holistic or not
> 2) Determine algebraic attribute (can be detected automatically for few
> cases, if automatic detection fails user should hint the algebraic attribute)
> 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm
> and generates annotated cube lattice (contains large group partitioning
> information)
> 4) Modify plan to distribute annotated cube lattice to all mappers using
> distributed cache
> 5) Execute actual cube materialization on full dataset
> 6) Modify MRPlan to insert a post process job for combining the results of
> actual cube materialization job
> 7) OOM exception handling
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira