[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771979#action_12771979
 ] 

Ying He commented on PIG-1062:
------------------------------

I would suggest to add the total number of tuples of a split into the last 
sample as a field. All other sample tuples can have this field as NULL. Then in 
PartitionSkewedKey.calculateReducers, it can add up this field from all the 
samples to get total number of tuples from input.

If we use a separate tuple with different format to represent total number of 
tuples, that would involve a bigger change. The sampling job currently add an 
"all" to all samples to group them into one bag, and then sort the tuples by 
keys. If tuples are of different format, the execution plan has to be changed 
to be more complex to deal with these special tuples.

> load-store-redesign branch: change SampleLoader and subclasses to work with 
> new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out 
> in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
> be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to