[ 
https://issues.apache.org/jira/browse/PIG-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-425:
-----------------------------------------------

    Status: Patch Available  (was: Open)

The MRCompiler currently tries to pack as many operators possible into a single 
phase. So when we have two cogroups one after the other, the LR in the second 
cogroup gets pushed into the reducer. Since the store just stores away, LRs 
output, if we load it and pass it to GR we should be just fine.

However, since IndexedTuple isn't implemented as a new kind of Tuple with a 
Factory, the Load on the other side tries to load a DefaultTuple from an 
IndexedTuple and incidentally succeeds due to the way IndexedTuple is 
serialized. However, this can't be carried any further and when the mapper 
tries to collect the IndexedTuple, it fails.

The fix I have is three fold. I have modified IndexedTuple's serialization to 
suit the solution. Second, I have made IndexedTuple a type of tuple by writing 
a different byte to the marker byte indicating that this is an 
IndexedTuple(like we identify null and non-null tuples). Third, I have modfied 
DataReaderWriter's readDatum method to check if we have an IndexedTuple and 
process it according to IndexedTuple's serialization format.

With this, to try out, I have removed the RearrangeAdjuster from MRCompiler to 
see if my hypothesis is correct. The unit tests passed except MRCompiler due to 
GoldenPlan issues. We need to run all the end to end tests against this patch 
and confirm that it works.

> Split -> distinct or order -> cogroup fails
> -------------------------------------------
>
>                 Key: PIG-425
>                 URL: https://issues.apache.org/jira/browse/PIG-425
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Shravan Matthur Narayanamurthy
>            Priority: Critical
>             Fix For: types_branch
>
>         Attachments: 425.patch
>
>
> A script like:
> {code}
> \a = load 'myfile' as (name:chararray, age:int, gpa:double);
> split a into a1 if age > 50, a2 if name < 'm';
> b2 = distinct a2;
> b1 = order a1 by name;
> c = cogroup b2 by name, b1 by name;
> d = foreach c generate flatten(group), COUNT($1), COUNT($2);
> store d into 'OUTPATH';
> {code}
> Will abort with the error:
> {code}
> 08/09/09 11:46:50 ERROR mapReduceLayer.Launcher: Error message from task 
> (map) tip_200809080906_0185_m_000000java.lang.ClassCastException: 
> org.apache.pig.data.DefaultTuple cannot be cast to 
> org.apache.pig.data.IndexedTuple
>     at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:81)
>     at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:135)
>     at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:75)
>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
>     at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
> {code}
> The issue is that the RearrangeAdjuster in MRCompiler is not properly seeing 
> this as a cogroup and moving the localrearrnge out of the reduce and into the
> map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to