[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772704#action_12772704
 ] 

Thejas M Nair commented on PIG-1062:
------------------------------------

As indicated in previous comment, I am planning to go ahead with the [earlier 
proposal|https://issues.apache.org/jira/browse/PIG-1062?focusedCommentId=12772197&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12772197]
 . The current sample frequency would be one tuple every ( (H/s) * (1/17) ) 
tuples.  

In PartitionSkewedKey.exec(),  the number of reducers for join key k1 can be 
computed using (no_of_samples(k1) / 17) . But the accuracy of this calculation 
depends on how accurate the average tuple size computed is (s in (H/s) * 
(1/17)). Sending a special tuple with number of rows in the split will likely 
lead to more accurate estimate of number of reducers required.

> load-store-redesign branch: change SampleLoader and subclasses to work with 
> new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out 
> in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
> be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to