[jira] [Commented] (PIG-2831) MR-Cube implementation (Distributed cubing for holistic measures)

Prasanth J (JIRA) Fri, 20 Jul 2012 01:51:42 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419009#comment-13419009
 ]


Prasanth J commented on PIG-2831:
---------------------------------

Hello everyone

With reference to the description of this issue, I am working on step 3 which 
involves creating a sampling job and executing naive cube computation algorithm 
over the sample dataset. The requirement for this sample job is that I should 
be able to select sample size proportional to the size of the input data size. 
This sampling job is required to determine the large group size and perform 
partitioning of the large groups so that no single reducer gets overloaded with 
large groups. 
One thing I am stuck with is dynamically choosing the sample size. In the 
current implementation I am using sample operator to load a fixed sample size 
(10% data). Since the sample size is not chosen dynamically this fixed sampling 
will result in over sampling for large datasets. For dynamically choosing the 
sample size, we need to know the total number of tuples in the input dataset. 
But finding the total number of tuples is not trivial. One way to find the 
total number of tuples is to first find the total input size and size of one 
tuple in memory. The problem with this approach is that since tuple is 
List<Object> the reported in-memory size of tuple will be much larger than 
actual size of row in bytes. To verify this I tested with a simple dataset 

Input file size : 319 bytes
Actual number of rows: 13
Number of dimensions: 5 
Schema: int, chararray, chararray, chararray, int
Actual row size: 319/13 ~= 25 bytes
In-memory tuple size reported: 264 bytes (~10x greater than actual size of row)

Since, in-memory tuple size is higher we cannot make a good estimation of the 
total number of rows in the dataset and hence the sample size.

Other approaches,
I looked into how PoissonSampleLoader and RandomSampleLoader works. Both takes 
a different approach for loading sample dataset. PoissonSampleLoader uses the 
distribution of the skewed key to generate sample rows that best represent the 
underlying data. This loader inserts a special marker at the last tuple with 
the number of rows in the dataset. Since, this loader is specifically meant for 
handling skewed keys, I cannot use this in my case for generating sample 
dataset. 
For using RandomSampleLoader, we need to specify the number of samples to be 
loaded beforehand so that the loader stops after loading the specified number 
of tuples. Since we need to specify the sample size before loading we have no 
means to dynamically load samples for datasets of varying size. 
Also, for using these 2 loaders we need to copy the entire dataset to a temp 
file and use any of these loaders to load data from temp file. This consumes an 
additional map job. I don't know why there is a need for copying entire dataset 
to a temp file and then reading back again. I believe the reason (from what I 
can understand from the source) for copying the dataset to temp file and 
reading from it is that the loader classes can only read using InterStorage 
format. 

I have listed below few pros and cons of different approaches 
1) +Using sample operator+ 
*Pros:* 
1 less map job compared to other loaders

*Cons:*
Reads entire dataset for generating sample dataset because sample operator is 
implemented as filter + RANDOM udf + less than expression(sample size) after 
projecting the input columns.
May result in oversampling for larger dataset

2) +RandomSampleLoader+
*Pros:* 
Fixed sample size (the paper provided in the description mentions that 2M 
sample size is good enough to represent 20B tuples, 100K is good enough for 2B 
tuples. plz refer page-6 in the paper.) 
Stops reading after sample size is reached (useful for large dataset) - NOT 
sure about this!! Please correct me if I am wrong.

*Cons:*
1 additional map job required ( including post processing there will be 4 MR 
jobs with 2 map only jobs ) 
Since fixed sample size is used this method is not scalable

3) +PoissonSampleLoader+
*Pros:*
Dynamically determines sample size
Can determine number of rows in dataset using special tuple

*Cons:*
1 additional map job required ( including post processing there will be 4 MR 
jobs with 2 map only jobs ) 
Not suitable for my usecase since the sample size generated is not proportional 
to input size

I think what I need is a hybrid loader (combination of concepts from random + 
poisson) which dynamically loads sample tuples based on the input dataset size. 

Any thoughts about how I can generate sample size proportional to input data 
size? Or is there any way I can find the number of rows available in a dataset? 
Am I missing any other ideas for finding/estimating the number of rows in the 
dataset?

                
> MR-Cube implementation (Distributed cubing for holistic measures)
> -----------------------------------------------------------------
>
>                 Key: PIG-2831
>                 URL: https://issues.apache.org/jira/browse/PIG-2831
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Prasanth J
>
> Implementing distributed cube materialization on holistic measure based on 
> MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. 
> Primary steps involved:
> 1) Identify if the measure is holistic or not
> 2) Determine algebraic attribute (can be detected automatically for few 
> cases, if automatic detection fails user should hint the algebraic attribute)
> 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm 
> and generates annotated cube lattice (contains large group partitioning 
> information)
> 4) Modify plan to distribute annotated cube lattice to all mappers using 
> distributed cache
> 5) Execute actual cube materialization on full dataset
> 6) Modify MRPlan to insert a post process job for combining the results of 
> actual cube materialization job
> 7) OOM exception handling

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2831) MR-Cube implementation (Distributed cubing for holistic measures)

Reply via email to