I see. Thanks Alan for your reply. 
Also one more question that I posted earlier was

I used RandomSampleLoader and specified a sample size of 100. The number of map 
tasks that are executed is 110. So I am expecting total samples that are 
received on the reducer to be 110*100 = 11000 but its always more than the 
expected value. The actual received tuples is between 14000 to 15000. I am not 
sure if its a bug or if I am missing something. Is it an expected behavior?

Thanks
-- Prasanth

On Aug 23, 2012, at 6:20 PM, Alan Gates <ga...@hortonworks.com> wrote:

> Sorry for the very slow response, but here it is, hopefully better late than 
> never.
> 
> On Jul 25, 2012, at 4:28 PM, Prasanth J wrote:
> 
>> Thanks Alan.
>> The requirement for me is that I want to load N number of samples based on 
>> the input file size and perform naive cube computation to determine the 
>> large groups that will not fit in reducer's memory. I need to know the exact 
>> number of samples for calculating the partition factor for large groups. 
>> Currently I am using RandomSampleLoader to load 1000 tuples from each 
>> mapper. Without knowing the number of mappers I will not be able to find the 
>> exact number of samples loaded. Also RandomSampleLoader doesn't attach any 
>> special marker (as in PoissonSampleLoader) tuples which tells the number of 
>> samples loaded. 
>> Is there any other way to know the exact number of samples loaded? 
> Not that I know of.
> 
>> 
>> By analyzing the MR plans of order-by and skewed-join, it seems like the 
>> entire dataset is copied to a temp file and then SampleLoaders use the temp 
>> file to load samples. Is there any specific reason for this redundant copy? 
>> Is it because SampleLoaders can only use pig's internal i/o format? 
> Partly, but also because it allows any operators that need to run before the 
> sample (like project or filter) to be placed in the pipeline.
> 
> Alan.
> 

Reply via email to