[ 
https://issues.apache.org/jira/browse/PIG-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031730#comment-13031730
 ] 

Dmitriy V. Ryaboy commented on PIG-2014:
----------------------------------------

Daniel,
So this is interesting. I took my fix out, left the test in, and the test still 
passed -- because, as you correctly pointed out, TestNewPlanFilterAboveForeach 
only invokes a few of the rules. If I add PushUpFilter to MyPlanOptimizer 
within that test, my new test starts failing if the fix is not present, and 
passes if the fix is present. So the PushUpFilter is definitely at least part 
of what's causing the movement of Filter in this case.

So I need to fix up PushDownForEachFlatten and FilterAboveForeach, *and* I need 
to fix my test :).

> SAMPLE shouldn't be pushed up
> -----------------------------
>
>                 Key: PIG-2014
>                 URL: https://issues.apache.org/jira/browse/PIG-2014
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.9.0, 0.10
>            Reporter: Jacob Perkins
>            Assignee: Dmitriy V. Ryaboy
>             Fix For: 0.9.0
>
>         Attachments: PIG-2014.patch
>
>
> Consider the following code:
> {code:none}
> tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, 
> weight:double);
> grouped   = GROUP tfidf_all BY doc_id;
> vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, 
> weight) AS vector;
> DUMP vectors;
> {code}
> This, of course, runs just fine. In a real example, tfidf_all contains 
> 1,428,280 records. The reduce output records should be exactly the number of 
> documents, which turn out to be 18,863 in this case. All well and good.
> The strangeness comes when you add a SAMPLE command:
> {code:none}
> sampled = SAMPLE vectors 0.0012;
> DUMP sampled;
> {code}
> Running this results in 1,513 reduce output records. The reduce output 
> records be much much closer to 22 or 23 records (eg. 0.0012*18863).
> Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in 
> front of the group. It shouldn't push that filter  
> since the UDF is non-deterministic.  
> Quick fix: If you add "-t PushUpFilter" to your command line when invoking 
> pig this won't happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to