Multiquery optimization miscalculates the parallelism and results in extra 0
bytes files (Pig 0.7 and 0.8)
----------------------------------------------------------------------------------------------------------
Key: PIG-1724
URL: https://issues.apache.org/jira/browse/PIG-1724
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Fix For: 0.8.0
We have found an issue with Pig 0.8 and Pig 0.7 when using Multiquery
optimization. It produces more number of part files than required. Please
observe that the GROUP ALL is a dummy in this case.
{code}
record002 = LOAD 'samplepig001.in' AS (id:chararray,num:int);
f_records002= FILTER record002 BY num!=50000;
group01 = GROUP f_records002 ALL PARALLEL 1;
STORE group01 INTO 'pig_out_direc_SET1';
set2 = FILTER f_records002 BY num!=200002;
set2_Group = GROUP set2 ALL PARALLEL 1;
STORE set2 INTO 'pig_out_direc_SET2';
set3 = FILTER f_records002 BY num!=100001;
set3_Group= GROUP set3 BY id PARALLEL 40;
--set3_Rec4= FILTER set3_Group by num!=5000000;
STORE set3_Group INTO 'pig_out_direc_SET3';
{code}
When run in Pig 0.8 it results in the following output.
{quote}
$ hadoop fs -ls /user/viraj/pig_out_direc_SET1
...
Found 40 items
rw------- 3 viraj users 0 2010-11-13 02:09
/user/viraj/pig_out_direc_SET1/part-r-00000
...
...
-rw------- 3 viraj users 0 2010-11-13 02:09
/user/viraj/pig_out_direc_SET1/part-r-00039
$ hadoop fs -ls /user/viraj/pig_out_direc_SET2
Found 1 items
-rw------- 3 viraj users 110 2010-11-13 02:08
/user/viraj/pig_out_direc_SET2/part-m-00000
$ hadoop fs -ls /user/viraj/pig_out_direc_SET3
Found 40 items
-rw------- 3 viraj users 0 2010-11-13 02:09
/user/viraj/pig_out_direc_SET3/part-r-00000
...
...
-rw------- 3 viraj users 0 2010-11-13 02:09
/user/viraj/pig_out_direc_SET3/part-r-00039
{quote}
Viraj
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.