[
https://issues.apache.org/jira/browse/PIG-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shravan Matthur Narayanamurthy updated PIG-425:
-----------------------------------------------
Status: Patch Available (was: Open)
The MRCompiler currently tries to pack as many operators possible into a single
phase. So when we have two cogroups one after the other, the LR in the second
cogroup gets pushed into the reducer. Since the store just stores away, LRs
output, if we load it and pass it to GR we should be just fine.
However, since IndexedTuple isn't implemented as a new kind of Tuple with a
Factory, the Load on the other side tries to load a DefaultTuple from an
IndexedTuple and incidentally succeeds due to the way IndexedTuple is
serialized. However, this can't be carried any further and when the mapper
tries to collect the IndexedTuple, it fails.
The fix I have is three fold. I have modified IndexedTuple's serialization to
suit the solution. Second, I have made IndexedTuple a type of tuple by writing
a different byte to the marker byte indicating that this is an
IndexedTuple(like we identify null and non-null tuples). Third, I have modfied
DataReaderWriter's readDatum method to check if we have an IndexedTuple and
process it according to IndexedTuple's serialization format.
With this, to try out, I have removed the RearrangeAdjuster from MRCompiler to
see if my hypothesis is correct. The unit tests passed except MRCompiler due to
GoldenPlan issues. We need to run all the end to end tests against this patch
and confirm that it works.
> Split -> distinct or order -> cogroup fails
> -------------------------------------------
>
> Key: PIG-425
> URL: https://issues.apache.org/jira/browse/PIG-425
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: types_branch
> Reporter: Alan Gates
> Assignee: Shravan Matthur Narayanamurthy
> Priority: Critical
> Fix For: types_branch
>
> Attachments: 425.patch
>
>
> A script like:
> {code}
> \a = load 'myfile' as (name:chararray, age:int, gpa:double);
> split a into a1 if age > 50, a2 if name < 'm';
> b2 = distinct a2;
> b1 = order a1 by name;
> c = cogroup b2 by name, b1 by name;
> d = foreach c generate flatten(group), COUNT($1), COUNT($2);
> store d into 'OUTPATH';
> {code}
> Will abort with the error:
> {code}
> 08/09/09 11:46:50 ERROR mapReduceLayer.Launcher: Error message from task
> (map) tip_200809080906_0185_m_000000java.lang.ClassCastException:
> org.apache.pig.data.DefaultTuple cannot be cast to
> org.apache.pig.data.IndexedTuple
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:81)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:135)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:75)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
> {code}
> The issue is that the RearrangeAdjuster in MRCompiler is not properly seeing
> this as a cogroup and moving the localrearrnge out of the reduce and into the
> map.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.