[ https://issues.apache.org/jira/browse/PIG-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitriy V. Ryaboy updated PIG-2014: ----------------------------------- Attachment: PIG-2014.patch Implemented suggested approach to fixing this. Please review. > SAMPLE shouldn't be pushed up > ----------------------------- > > Key: PIG-2014 > URL: https://issues.apache.org/jira/browse/PIG-2014 > Project: Pig > Issue Type: Bug > Reporter: Jacob Perkins > Assignee: Dmitriy V. Ryaboy > Attachments: PIG-2014.patch > > > Consider the following code: > {code:none} > tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, > weight:double); > grouped = GROUP tfidf_all BY doc_id; > vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, > weight) AS vector; > DUMP vectors; > {code} > This, of course, runs just fine. In a real example, tfidf_all contains > 1,428,280 records. The reduce output records should be exactly the number of > documents, which turn out to be 18,863 in this case. All well and good. > The strangeness comes when you add a SAMPLE command: > {code:none} > sampled = SAMPLE vectors 0.0012; > DUMP sampled; > {code} > Running this results in 1,513 reduce output records. The reduce output > records be much much closer to 22 or 23 records (eg. 0.0012*18863). > Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in > front of the group. It shouldn't push that filter > since the UDF is non-deterministic. > Quick fix: If you add "-t PushUpFilter" to your command line when invoking > pig this won't happen. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira