I think we decided to instead stub in a special loader that reads a few records from each underlying split, in a single mapper (by using a single wrapping split), right?
On Thu, Aug 23, 2012 at 7:55 PM, Prasanth J <[email protected]> wrote: > I see. Thanks Alan for your reply. > Also one more question that I posted earlier was > > I used RandomSampleLoader and specified a sample size of 100. The number of > map tasks that are executed is 110. So I am expecting total samples that are > received on the reducer to be 110*100 = 11000 but its always more than the > expected value. The actual received tuples is between 14000 to 15000. I am > not sure if its a bug or if I am missing something. Is it an expected > behavior? > > Thanks > -- Prasanth > > On Aug 23, 2012, at 6:20 PM, Alan Gates <[email protected]> wrote: > >> Sorry for the very slow response, but here it is, hopefully better late than >> never. >> >> On Jul 25, 2012, at 4:28 PM, Prasanth J wrote: >> >>> Thanks Alan. >>> The requirement for me is that I want to load N number of samples based on >>> the input file size and perform naive cube computation to determine the >>> large groups that will not fit in reducer's memory. I need to know the >>> exact number of samples for calculating the partition factor for large >>> groups. >>> Currently I am using RandomSampleLoader to load 1000 tuples from each >>> mapper. Without knowing the number of mappers I will not be able to find >>> the exact number of samples loaded. Also RandomSampleLoader doesn't attach >>> any special marker (as in PoissonSampleLoader) tuples which tells the >>> number of samples loaded. >>> Is there any other way to know the exact number of samples loaded? >> Not that I know of. >> >>> >>> By analyzing the MR plans of order-by and skewed-join, it seems like the >>> entire dataset is copied to a temp file and then SampleLoaders use the temp >>> file to load samples. Is there any specific reason for this redundant copy? >>> Is it because SampleLoaders can only use pig's internal i/o format? >> Partly, but also because it allows any operators that need to run before the >> sample (like project or filter) to be placed in the pipeline. >> >> Alan. >> >
