[ 
https://issues.apache.org/jira/browse/PIG-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644485#action_12644485
 ] 

Pradeep Kamath commented on PIG-514:
------------------------------------

The issue is that for each group in the input data, one of the filters always 
filters out all data and the POFilter returns an POStatus.STATUS_EOP. The 
POUserFunc sees this EOP and does not call the actual UDF (COUNT() or SUM()) 
and just sends the EOP to POForeach. The POForeach sees this EOP and just 
finishes processing that group without outputting any results.
Ideally for COUNT() and SUM() POUserFunc should send an empty bag as input so 
that COUNT() can be 0 and SUM can be null. However this issue is also present 
in the following code:
{code}
a = load 'bla';
b = filter a by 2 == 1; -- this is just an illustration of an aggressive filter 
which filters every tuple
c = foreach b generate myudf($0);
{code}

In the above case also myudf() is never called - is it ok to not call the udf 
when there is no input to give it (EOP case)? This causes queries like the one 
in the description to not give the correct COUNT of 0 and SUM of null in cases 
where the input to them is empty - we need to decide how we should handle this 
general case (both for aggregate functions like COUNTs and non aggregate 
functions like myudf())

One other case of the COUNT problem is:
{code}
a = load 'emptyfile'; -- load an empty file
-- neither of the statements below actually ever get executed
b = group a all;
c = foreach b generate COUNT(a);
{code}
When the input data is empty, neither map() nor reduce() gets executed and 
hence COUNT() never gets called.


> COUNT returns no results as a result of two filter statements in FOREACH
> ------------------------------------------------------------------------
>
>                 Key: PIG-514
>                 URL: https://issues.apache.org/jira/browse/PIG-514
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Viraj Bhat
>             Fix For: types_branch
>
>         Attachments: mystudentfile.txt
>
>
> For the following piece of sample code in FOREACH which counts the filtered 
> student records based on record_type == 1 and scores and also on record_type 
> == 0 does not seem to return any results.
> {code}
> mydata = LOAD 'mystudentfile.txt' AS  (record_type,name,age,scores,gpa);
> --keep only what we need
> mydata_filtered = FOREACH  mydata GENERATE   record_type,  name,  age,  
> scores ;
> --group
> mydata_grouped = GROUP mydata_filtered BY  (record_type,age);
> myfinaldata = FOREACH mydata_grouped {
>      myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == scores;
>      myfilter2 = FILTER mydata_filtered BY record_type == 0;
>      GENERATE FLATTEN(group),
> -- Only this count causes the problem ??
>       COUNT(myfilter1) as col2,
>       SUM(myfilter2.scores) as col3,
>       COUNT(myfilter2) as col4;  };
> --these set of statements confirm that the count on the  filters returns 1
> --mycountdata = FOREACH mydata_grouped
> --{
> --      myfilter1 = FILTER mydata_filtered BY record_type == 1 AND age == 
> scores;
> --      GENERATE
> --      COUNT(myfilter1) as colcount;
> --};
> --dump mycountdata;
> dump myfinaldata;
> {code}
> But if you uncomment the  {code} COUNT(myfilter1) as col2, {code}, it seems 
> to work with the following results..
> (0,22,45.0,2L)
> (0,24,133.0,6L)
> (0,25,22.0,1L)
> Also I have tried to verify if this is a issue with the {code} 
> COUNT(myfilter1) as col2, {code} returning zero. It does not seem to be the 
> case.
> If {code}  dump mycountdata; {code} is uncommented it returns:
> (1L)
> (1L)
> I am attaching the tab separated 'mystudentfile.txt' file used in this Pig 
> script. Is this an issue with 2 filters in the FOREACH followed by a COUNT on 
> these filters??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to