[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776526#action_12776526 ]
Thejas M Nair commented on PIG-1062: ------------------------------------ Proposal for sampling in RandomSampleLoader (as well as SampleLoader class)- (used for order-by queries) - Problem: With new interface, we cannot use the old approach of dividing the size of file by number of samples required and skipping that many bytes to get new sample. Proposal: The approach proposed by Dmitriy for sampling is used - bq. In getNext(), we can now allocate a buffer for T elements, populate it with the first T tuples, and continue scanning the partition. For every ith next() call, we generate a random number r s.t. 0<=r<i, and if r<T we insert the new tuple into our buffer at position r. This gives us a nicely random sample of the tuples in the partition. To avoid parsing all tuples RecordReader.nextKeyValue() will be called (instead of loader.getNext()) if the current tuple is to be skipped. bq. It looks like ReduceContext has a getCounter() method. Am I missing a subtlety? Arun C Murthy (mapreduce comitter) has agreed to elaborate on his recommendation on this in the jira. > load-store-redesign branch: change SampleLoader and subclasses to work with > new LoadFunc interface > --------------------------------------------------------------------------------------------------- > > Key: PIG-1062 > URL: https://issues.apache.org/jira/browse/PIG-1062 > Project: Pig > Issue Type: Sub-task > Reporter: Thejas M Nair > Assignee: Thejas M Nair > > This is part of the effort to implement new load store interfaces as laid out > in http://wiki.apache.org/pig/LoadStoreRedesignProposal . > PigStorage and BinStorage are now working. > SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to > be changed to work with new LoadFunc interface. > Fixing SampleLoader and RandomSampleLoader will get order-by queries working. > PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.