[jira] [Updated] (PIG-2014) SAMPLE shouldn't be pushed up

Dmitriy V. Ryaboy (JIRA) Mon, 09 May 2011 23:26:45 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dmitriy V. Ryaboy updated PIG-2014:
-----------------------------------

    Attachment: PIG-2014.patch

Implemented suggested approach to fixing this. Please review.

> SAMPLE shouldn't be pushed up
> -----------------------------
>
>                 Key: PIG-2014
>                 URL: https://issues.apache.org/jira/browse/PIG-2014
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Jacob Perkins
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: PIG-2014.patch
>
>
> Consider the following code:
> {code:none}
> tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, 
> weight:double);
> grouped   = GROUP tfidf_all BY doc_id;
> vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, 
> weight) AS vector;
> DUMP vectors;
> {code}
> This, of course, runs just fine. In a real example, tfidf_all contains 
> 1,428,280 records. The reduce output records should be exactly the number of 
> documents, which turn out to be 18,863 in this case. All well and good.
> The strangeness comes when you add a SAMPLE command:
> {code:none}
> sampled = SAMPLE vectors 0.0012;
> DUMP sampled;
> {code}
> Running this results in 1,513 reduce output records. The reduce output 
> records be much much closer to 22 or 23 records (eg. 0.0012*18863).
> Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in 
> front of the group. It shouldn't push that filter  
> since the UDF is non-deterministic.  
> Quick fix: If you add "-t PushUpFilter" to your command line when invoking 
> pig this won't happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2014) SAMPLE shouldn't be pushed up

Reply via email to