Thanks Alan.
The requirement for me is that I want to load N number of samples based on the 
input file size and perform naive cube computation to determine the large 
groups that will not fit in reducer's memory. I need to know the exact number 
of samples for calculating the partition factor for large groups. 
Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. 
Without knowing the number of mappers I will not be able to find the exact 
number of samples loaded. Also RandomSampleLoader doesn't attach any special 
marker (as in PoissonSampleLoader) tuples which tells the number of samples 
loaded. 
Is there any other way to know the exact number of samples loaded? 

By analyzing the MR plans of order-by and skewed-join, it seems like the entire 
dataset is copied to a temp file and then SampleLoaders use the temp file to 
load samples. Is there any specific reason for this redundant copy? Is it 
because SampleLoaders can only use pig's internal i/o format? 

Thanks
-- Prasanth

On Jul 25, 2012, at 6:49 PM, Alan Gates <ga...@hortonworks.com> wrote:

> No.  The number of mappers is determined by the InputFormat used by your load 
> function (TextInputFormat if you're using the default PigStorage loader) when 
> the Hadoop job is submitted.  Pig doesn't have access to that info until it's 
> handed the jobs off to MapReduce.
> 
> Alan.
> 
> On Jul 25, 2012, at 3:47 PM, Prasanth J wrote:
> 
>> Hello everyone
>> 
>> I would like know if there is a way to know the number of mappers while 
>> compiling physical plan to MR-plan. 
>> 
>> Thanks
>> -- Prasanth
>> 
> 

Reply via email to