[jira] Created: (PIG-1724) Multiquery optimization miscalculates the parallelism and results in extra 0 bytes files (Pig 0.7 and 0.8)

Viraj Bhat (JIRA) Fri, 12 Nov 2010 18:36:40 -0800

Multiquery optimization miscalculates the parallelism and results in extra 0 
bytes files (Pig 0.7 and 0.8)
----------------------------------------------------------------------------------------------------------


                 Key: PIG-1724
                 URL: https://issues.apache.org/jira/browse/PIG-1724
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: 0.7.0
            Reporter: Viraj Bhat
             Fix For: 0.8.0


We have found an issue with Pig 0.8 and Pig 0.7 when using Multiquery 
optimization. It produces more number of part files than required. Please 
observe that the GROUP ALL is a dummy in this case.


{code}
record002 = LOAD 'samplepig001.in' AS (id:chararray,num:int);
f_records002= FILTER record002 BY num!=50000;
group01 = GROUP f_records002 ALL PARALLEL 1;
STORE group01 INTO 'pig_out_direc_SET1';


set2 = FILTER f_records002 BY num!=200002;
set2_Group = GROUP set2 ALL PARALLEL 1;
STORE set2 INTO 'pig_out_direc_SET2';

set3 = FILTER f_records002 BY num!=100001;
set3_Group= GROUP set3 BY id PARALLEL 40;
--set3_Rec4= FILTER set3_Group by num!=5000000;
STORE set3_Group INTO 'pig_out_direc_SET3';
{code}


When run in Pig 0.8 it results in the following output.

{quote}
$ hadoop fs -ls /user/viraj/pig_out_direc_SET1
...
Found 40 items
rw-------   3 viraj users          0 2010-11-13 02:09 
/user/viraj/pig_out_direc_SET1/part-r-00000
...
...
-rw-------   3 viraj users          0 2010-11-13 02:09 
/user/viraj/pig_out_direc_SET1/part-r-00039

$ hadoop fs -ls /user/viraj/pig_out_direc_SET2
Found 1 items
-rw-------   3 viraj users        110 2010-11-13 02:08 
/user/viraj/pig_out_direc_SET2/part-m-00000


$ hadoop fs -ls /user/viraj/pig_out_direc_SET3
Found 40 items
-rw-------   3 viraj users          0 2010-11-13 02:09 
/user/viraj/pig_out_direc_SET3/part-r-00000
...
...
-rw-------   3 viraj users          0 2010-11-13 02:09 
/user/viraj/pig_out_direc_SET3/part-r-00039

{quote}


Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1724) Multiquery optimization miscalculates the parallelism and results in extra 0 bytes files (Pig 0.7 and 0.8)

Reply via email to