William Watson created PIG-5208:
-----------------------------------
Summary: Two HBase Loads Followed By a Merge Join Fails in
Mapreduce or Tez Mode
Key: PIG-5208
URL: https://issues.apache.org/jira/browse/PIG-5208
Project: Pig
Issue Type: Bug
Reporter: William Watson
I posted this issue to the mailing list awhile back and didn't get a response.
Today, I picked this back up, tried on Tez instead of Mapreduce and got the
same error. In local mode, this works. As far as I can tell, I've been able to
replicate this enough that I feel this is a real bug in pig.
Here's the original mailing list post with all the details I have from the
original time I documented this error:
https://www.mail-archive.com/[email protected]/msg10553.html
Here's the stack trace from my tez run today: {code}
2084439 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2998:
Unhandled internal error. Vertex failed, vertexName=scope-1797,
vertexId=vertex_1490968035192_0008_1_01, diagnostics=[Task failed,
taskId=task_1490968035192_0008_1_01_000000, diagnostics=[TaskAttempt 0 failed,
info=[Error: Error while running task ( failure ) :
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
Error while doing final merge
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:318)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:285)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException:
org.apache.pig.backend.hadoop.hbase.TableSplitComparable cannot be cast to
org.apache.hadoop.hbase.mapreduce.TableSplit
at
org.apache.pig.backend.hadoop.hbase.TableSplitComparable.compareTo(TableSplitComparable.java:26)
at org.apache.pig.data.DataType.compare(DataType.java:566)
at org.apache.pig.data.DataType.compare(DataType.java:464)
at
org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compareDatum(BinInterSedes.java:1106)
at
org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compare(BinInterSedes.java:1082)
at
org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compareBinSedesTuple(BinInterSedes.java:787)
at
org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compare(BinInterSedes.java:728)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleSortComparator.compare(PigTupleSortComparator.java:100)
at
org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.lessThan(TezMerger.java:684)
at org.apache.hadoop.util.PriorityQueue.upHeap(PriorityQueue.java:128)
at org.apache.hadoop.util.PriorityQueue.put(PriorityQueue.java:55)
at
org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.merge(TezMerger.java:783)
at
org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.merge(TezMerger.java:694)
at
org.apache.tez.runtime.library.common.sort.impl.TezMerger.merge(TezMerger.java:150)
at
org.apache.tez.runtime.library.common.sort.impl.TezMerger.merge(TezMerger.java:132)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.finalMerge(MergeManager.java:1124)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.close(MergeManager.java:583)
at
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:314)
... 6 more
{code}
And here's the test script I was using with the names of tables and columns
changed: {code}
side_a = LOAD 'hbase://ads' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'cf1:user_id cf1:ad_id',
'-minTimestamp=1470024000000 -maxTimestamp=1491019199000
-regex=\\\\|agds=(156)\\\\|'
) AS (user_id:chararray, ad_id:chararray);
side_a = FILTER side_a BY ad_id == '440';
side_b = LOAD 'hbase://ads' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'cf1:user_id cf1:ad_id',
'-minTimestamp=1470024000000 -maxTimestamp=1491019199000
-regex=\\\\|agds=(156)\\\\|'
) AS (user_id:chararray, ad_id:chararray);
side_b = FILTER side_b BY ad_id == '439';
side_b = JOIN
side_a BY user_id,
side_b BY user_id
USING 'merge';
after_merge_join = FOREACH side_b GENERATE
side_b::user_id;
STORE after_merge_join
INTO 'hbase://results'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('', '');
{code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)