[ 
https://issues.apache.org/jira/browse/PIG-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Derek Wollenstein updated PIG-2178:
-----------------------------------

    Description: 
Pig is generating a plan that eliminates half of input data when using FILTER BY

To better illustrate, I created a small test case.
1. Create a file in HDFS called "/testinput"
   The contents of the file should be:
"1\ta\taline\n1\tb\tbline"
2. Run the following pig script:
ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, 
child_id:chararray, value:chararray);
-- Split into two inputs based on the value of child_id
A = FILTER ORIG BY child_id =='a';
B = FILTER ORIG BY child_id =='b';
-- Project out the column which chooses the correct data set
APROJ = FOREACH A GENERATE parent_id, value;
BPROJ = FOREACH B GENERATE parent_id, value;
-- Merge both datasets by parent id
ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
-- Project the result
ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, 
APROJ::value,BPROJ::value;
DUMP ABPROJ;
3. The resulting tuple will be
(1,aline,aline)


  was:
Pig is generating a plan that eliminates half of input data when using FILTER BY

To better illustarte, I created a small test case.
1. Create a file in HDFS called "/testinput"
   The contents of the file should be:
"1\ta\taline\n1\tb\tbline"
2. Run the following pig script:
ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, 
child_id:chararray, value:chararray);
-- Split into two inputs based on the value of child_id
A = FILTER ORIG BY child_id =='a';
B = FILTER ORIG BY child_id =='b';
-- Project out the column which chooses the correct data set
APROJ = FOREACH A GENERATE parent_id, value;
BPROJ = FOREACH B GENERATE parent_id, value;
-- Merge both datasets by parent id
ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
-- Project the result
ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, 
APROJ::value,BPROJ::value;
DUMP ABPROJ;
3. The resulting tuple will be
(1,aline,aline)



> Filtering a source and then merging the filtered rows only generates data 
> from one half of the filtering
> --------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-2178
>                 URL: https://issues.apache.org/jira/browse/PIG-2178
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.8.1
>            Reporter: Derek Wollenstein
>
> Pig is generating a plan that eliminates half of input data when using FILTER 
> BY
> To better illustrate, I created a small test case.
> 1. Create a file in HDFS called "/testinput"
>    The contents of the file should be:
> "1\ta\taline\n1\tb\tbline"
> 2. Run the following pig script:
> ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, 
> child_id:chararray, value:chararray);
> -- Split into two inputs based on the value of child_id
> A = FILTER ORIG BY child_id =='a';
> B = FILTER ORIG BY child_id =='b';
> -- Project out the column which chooses the correct data set
> APROJ = FOREACH A GENERATE parent_id, value;
> BPROJ = FOREACH B GENERATE parent_id, value;
> -- Merge both datasets by parent id
> ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
> -- Project the result
> ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, 
> APROJ::value,BPROJ::value;
> DUMP ABPROJ;
> 3. The resulting tuple will be
> (1,aline,aline)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to