[
https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419009#comment-13419009
]
Prasanth J commented on PIG-2831:
---------------------------------
Hello everyone
With reference to the description of this issue, I am working on step 3 which
involves creating a sampling job and executing naive cube computation algorithm
over the sample dataset. The requirement for this sample job is that I should
be able to select sample size proportional to the size of the input data size.
This sampling job is required to determine the large group size and perform
partitioning of the large groups so that no single reducer gets overloaded with
large groups.
One thing I am stuck with is dynamically choosing the sample size. In the
current implementation I am using sample operator to load a fixed sample size
(10% data). Since the sample size is not chosen dynamically this fixed sampling
will result in over sampling for large datasets. For dynamically choosing the
sample size, we need to know the total number of tuples in the input dataset.
But finding the total number of tuples is not trivial. One way to find the
total number of tuples is to first find the total input size and size of one
tuple in memory. The problem with this approach is that since tuple is
List<Object> the reported in-memory size of tuple will be much larger than
actual size of row in bytes. To verify this I tested with a simple dataset
Input file size : 319 bytes
Actual number of rows: 13
Number of dimensions: 5
Schema: int, chararray, chararray, chararray, int
Actual row size: 319/13 ~= 25 bytes
In-memory tuple size reported: 264 bytes (~10x greater than actual size of row)
Since, in-memory tuple size is higher we cannot make a good estimation of the
total number of rows in the dataset and hence the sample size.
Other approaches,
I looked into how PoissonSampleLoader and RandomSampleLoader works. Both takes
a different approach for loading sample dataset. PoissonSampleLoader uses the
distribution of the skewed key to generate sample rows that best represent the
underlying data. This loader inserts a special marker at the last tuple with
the number of rows in the dataset. Since, this loader is specifically meant for
handling skewed keys, I cannot use this in my case for generating sample
dataset.
For using RandomSampleLoader, we need to specify the number of samples to be
loaded beforehand so that the loader stops after loading the specified number
of tuples. Since we need to specify the sample size before loading we have no
means to dynamically load samples for datasets of varying size.
Also, for using these 2 loaders we need to copy the entire dataset to a temp
file and use any of these loaders to load data from temp file. This consumes an
additional map job. I don't know why there is a need for copying entire dataset
to a temp file and then reading back again. I believe the reason (from what I
can understand from the source) for copying the dataset to temp file and
reading from it is that the loader classes can only read using InterStorage
format.
I have listed below few pros and cons of different approaches
1) +Using sample operator+
*Pros:*
1 less map job compared to other loaders
*Cons:*
Reads entire dataset for generating sample dataset because sample operator is
implemented as filter + RANDOM udf + less than expression(sample size) after
projecting the input columns.
May result in oversampling for larger dataset
2) +RandomSampleLoader+
*Pros:*
Fixed sample size (the paper provided in the description mentions that 2M
sample size is good enough to represent 20B tuples, 100K is good enough for 2B
tuples. plz refer page-6 in the paper.)
Stops reading after sample size is reached (useful for large dataset) - NOT
sure about this!! Please correct me if I am wrong.
*Cons:*
1 additional map job required ( including post processing there will be 4 MR
jobs with 2 map only jobs )
Since fixed sample size is used this method is not scalable
3) +PoissonSampleLoader+
*Pros:*
Dynamically determines sample size
Can determine number of rows in dataset using special tuple
*Cons:*
1 additional map job required ( including post processing there will be 4 MR
jobs with 2 map only jobs )
Not suitable for my usecase since the sample size generated is not proportional
to input size
I think what I need is a hybrid loader (combination of concepts from random +
poisson) which dynamically loads sample tuples based on the input dataset size.
Any thoughts about how I can generate sample size proportional to input data
size? Or is there any way I can find the number of rows available in a dataset?
Am I missing any other ideas for finding/estimating the number of rows in the
dataset?
> MR-Cube implementation (Distributed cubing for holistic measures)
> -----------------------------------------------------------------
>
> Key: PIG-2831
> URL: https://issues.apache.org/jira/browse/PIG-2831
> Project: Pig
> Issue Type: Sub-task
> Reporter: Prasanth J
>
> Implementing distributed cube materialization on holistic measure based on
> MR-Cube approach as described in http://arnab.org/files/mrcube.pdf.
> Primary steps involved:
> 1) Identify if the measure is holistic or not
> 2) Determine algebraic attribute (can be detected automatically for few
> cases, if automatic detection fails user should hint the algebraic attribute)
> 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm
> and generates annotated cube lattice (contains large group partitioning
> information)
> 4) Modify plan to distribute annotated cube lattice to all mappers using
> distributed cache
> 5) Execute actual cube materialization on full dataset
> 6) Modify MRPlan to insert a post process job for combining the results of
> actual cube materialization job
> 7) OOM exception handling
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira