[ https://issues.apache.org/jira/browse/PIG-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Derek Wollenstein updated PIG-2178: ----------------------------------- Description: Pig is generating a plan that eliminates half of input data when using FILTER BY To better illustrate, I created a small test case. 1. Create a file in HDFS called "/testinput" The contents of the file should be: "1\ta\taline\n1\tb\tbline" 2. Run the following pig script: ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, child_id:chararray, value:chararray); -- Split into two inputs based on the value of child_id A = FILTER ORIG BY child_id =='a'; B = FILTER ORIG BY child_id =='b'; -- Project out the column which chooses the correct data set APROJ = FOREACH A GENERATE parent_id, value; BPROJ = FOREACH B GENERATE parent_id, value; -- Merge both datasets by parent id ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id; -- Project the result ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, APROJ::value,BPROJ::value; DUMP ABPROJ; 3. The resulting tuple will be (1,aline,aline) was: Pig is generating a plan that eliminates half of input data when using FILTER BY To better illustarte, I created a small test case. 1. Create a file in HDFS called "/testinput" The contents of the file should be: "1\ta\taline\n1\tb\tbline" 2. Run the following pig script: ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, child_id:chararray, value:chararray); -- Split into two inputs based on the value of child_id A = FILTER ORIG BY child_id =='a'; B = FILTER ORIG BY child_id =='b'; -- Project out the column which chooses the correct data set APROJ = FOREACH A GENERATE parent_id, value; BPROJ = FOREACH B GENERATE parent_id, value; -- Merge both datasets by parent id ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id; -- Project the result ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, APROJ::value,BPROJ::value; DUMP ABPROJ; 3. The resulting tuple will be (1,aline,aline) > Filtering a source and then merging the filtered rows only generates data > from one half of the filtering > -------------------------------------------------------------------------------------------------------- > > Key: PIG-2178 > URL: https://issues.apache.org/jira/browse/PIG-2178 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.8.1 > Reporter: Derek Wollenstein > > Pig is generating a plan that eliminates half of input data when using FILTER > BY > To better illustrate, I created a small test case. > 1. Create a file in HDFS called "/testinput" > The contents of the file should be: > "1\ta\taline\n1\tb\tbline" > 2. Run the following pig script: > ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, > child_id:chararray, value:chararray); > -- Split into two inputs based on the value of child_id > A = FILTER ORIG BY child_id =='a'; > B = FILTER ORIG BY child_id =='b'; > -- Project out the column which chooses the correct data set > APROJ = FOREACH A GENERATE parent_id, value; > BPROJ = FOREACH B GENERATE parent_id, value; > -- Merge both datasets by parent id > ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id; > -- Project the result > ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, > APROJ::value,BPROJ::value; > DUMP ABPROJ; > 3. The resulting tuple will be > (1,aline,aline) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira