Hi Team,

We are facing an issue when we use IsEmpty UDF with FILTER

Scenario:
We have two input files:-

Input File 1: - first
1|11|111|1111
2|22|222|2222
3|33|333|3333
4|44|444|4444
5|55|555|5555

Input File 2: - second
1|a|aa|aaa
2|22|bb|bbb
3|c|cc|ccc
6|d|dd|ddd


Our requirement is , on grouping these two input files on the first two keys, 
it should give output only when data is present in both the files for a 
particular key otherwise it should print nothing.
>From the above input files, for key values (2,22), it should only print output 
>like below :-

((2,22),{(2,22,222,2222)},{(2,22,bb,bbb)})

To achieve this, we wrote the code as below:-

first = LOAD 'first' USING PigStorage('|') as 
(a:chararray,b:chararray,c:chararray,d:chararray);

second = LOAD 'second' USING PigStorage('|') as 
(aa:chararray,bb:chararray,cc:chararray,dd:chararray);

cogroup_join = COGROUP first BY (a,b) , second BY (aa,bb);

cogroup_join_filter = FILTER cogroup_join BY NOT IsEmpty(second) AND NOT 
IsEmpty(first);

dump cogroup_join_filter;

But, the output for the cogroup_join_filter is:
((1,a),{},{(1,a,aa,aaa)})
((2,22),{(2,22,222,2222)},{(2,22,bb,bbb)})
((3,c),{},{(3,c,cc,ccc)})
((6,d),{},{(6,d,dd,ddd)})

In my opinion, IsEmpty should have filtered out other values where it does not 
find corresponding key values same in both input file except for (2,22).
But the same is not happening.
Please have a look and provide your view on this.

Thanks & Regards,
Pankaj Ojha

This message, including any attachments, is the property of Sears Holdings 
Corporation and/or one of its subsidiaries. It is confidential and may contain 
proprietary or legally privileged information. If you are not the intended 
recipient, please delete it without reading the contents. Thank you.

Reply via email to