Filtering bag in nested foreach does not produce expected results -----------------------------------------------------------------
Key: PIG-710 URL: https://issues.apache.org/jira/browse/PIG-710 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz I have an idiom I used to use in older versions of pig (prior to types branch) which would group into a collection and then filter the output if any of the collection contained a particular string. This relies on FILTER statements within a FOREACH ... { ... GENERATE ... } statement. ORDER ... BY in the FOREACH ... { ... GENERATE ... } statement does not seem to have a problem so it seems to be something isolated to the FILTER. {code} A = load 'filterbug.data' using PigStorage() as ( id, str ); B = group A by ( id ); describe B; dump B; D = foreach B generate group, COUNT(A), A.str; describe D; dump D; C = foreach B { D = order A by str; matchedcount = COUNT(D); generate group, matchedcount as matchedcount, D.str; }; describe C; dump C; Cfiltered = foreach B { D = filter A by ( str matches 'hello' ); matchedcount = COUNT(D); generate group, matchedcount as matchedcount, A.str; }; describe Cfiltered; dump Cfiltered; {code} Here's the output: {code} -bash-3.00$ pig -exectype local -latest filterbug.pig USING: /grid/0/gs/pig/current B: {group: bytearray,A: {id: bytearray,str: bytearray}} 2009-03-10 03:14:14,838 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-03-10 03:14:14,839 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (a,{(a,hello),(a,goodbye)}) (b,{(b,goodbye)}) (c,{(c,hello),(c,hello),(c,hello)}) (d,{(d,what)}) D: {group: bytearray,long,str: {str: bytearray}} 2009-03-10 03:14:14,920 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-03-10 03:14:14,920 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (a,2L,{(hello),(goodbye)}) (b,1L,{(goodbye)}) (c,3L,{(hello),(hello),(hello)}) (d,1L,{(what)}) C: {group: bytearray,matchedcount: long,str: {str: bytearray}} 2009-03-10 03:14:14,985 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-03-10 03:14:14,985 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (a,2L,{(goodbye),(hello)}) (b,1L,{(goodbye)}) (c,3L,{(hello),(hello),(hello)}) (d,1L,{(what)}) 2009-03-10 03:14:15,018 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s). Cfiltered: {group: bytearray,matchedcount: long,str: {str: bytearray}} 2009-03-10 03:14:15,044 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s). 2009-03-10 03:14:15,057 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-03-10 03:14:15,057 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (a,1L,{(hello),(goodbye)}) {code} What I expect for the output of Cfiltered is actually: (a,1L,{(hello),(goodbye)}) (b,0L,{(goodbye)}) (c,3L,{(hello),(hello),(hello)}) (d,0L,{(what)}) The data file is: {code} a hello a goodbye b goodbye c hello c hello c hello d what {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.