William Watson created PIG-5208:
-----------------------------------

             Summary: Two HBase Loads Followed By a Merge Join Fails in 
Mapreduce or Tez Mode
                 Key: PIG-5208
                 URL: https://issues.apache.org/jira/browse/PIG-5208
             Project: Pig
          Issue Type: Bug
            Reporter: William Watson


I posted this issue to the mailing list awhile back and didn't get a response. 
Today, I picked this back up, tried on Tez instead of Mapreduce and got the 
same error. In local mode, this works. As far as I can tell, I've been able to 
replicate this enough that I feel this is a real bug in pig.

Here's the original mailing list post with all the details I have from the 
original time I documented this error: 
https://www.mail-archive.com/user@pig.apache.org/msg10553.html

Here's the stack trace from my tez run today: {code}
2084439 [main] ERROR org.apache.pig.tools.grunt.GruntParser  - ERROR 2998: 
Unhandled internal error. Vertex failed, vertexName=scope-1797, 
vertexId=vertex_1490968035192_0008_1_01, diagnostics=[Task failed, 
taskId=task_1490968035192_0008_1_01_000000, diagnostics=[TaskAttempt 0 failed, 
info=[Error: Error while running task ( failure ) : 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
 Error while doing final merge
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:318)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:285)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: 
org.apache.pig.backend.hadoop.hbase.TableSplitComparable cannot be cast to 
org.apache.hadoop.hbase.mapreduce.TableSplit
        at 
org.apache.pig.backend.hadoop.hbase.TableSplitComparable.compareTo(TableSplitComparable.java:26)
        at org.apache.pig.data.DataType.compare(DataType.java:566)
        at org.apache.pig.data.DataType.compare(DataType.java:464)
        at 
org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compareDatum(BinInterSedes.java:1106)
        at 
org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compare(BinInterSedes.java:1082)
        at 
org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compareBinSedesTuple(BinInterSedes.java:787)
        at 
org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compare(BinInterSedes.java:728)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleSortComparator.compare(PigTupleSortComparator.java:100)
        at 
org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.lessThan(TezMerger.java:684)
        at org.apache.hadoop.util.PriorityQueue.upHeap(PriorityQueue.java:128)
        at org.apache.hadoop.util.PriorityQueue.put(PriorityQueue.java:55)
        at 
org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.merge(TezMerger.java:783)
        at 
org.apache.tez.runtime.library.common.sort.impl.TezMerger$MergeQueue.merge(TezMerger.java:694)
        at 
org.apache.tez.runtime.library.common.sort.impl.TezMerger.merge(TezMerger.java:150)
        at 
org.apache.tez.runtime.library.common.sort.impl.TezMerger.merge(TezMerger.java:132)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.finalMerge(MergeManager.java:1124)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.close(MergeManager.java:583)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:314)
        ... 6 more
{code}

And here's the test script I was using with the names of tables and columns 
changed: {code}
side_a = LOAD 'hbase://ads' USING
          org.apache.pig.backend.hadoop.hbase.HBaseStorage(
            'cf1:user_id cf1:ad_id',
            '-minTimestamp=1470024000000 -maxTimestamp=1491019199000 
-regex=\\\\|agds=(156)\\\\|'
          ) AS (user_id:chararray, ad_id:chararray);
side_a = FILTER side_a BY ad_id == '440';

side_b = LOAD 'hbase://ads' USING
          org.apache.pig.backend.hadoop.hbase.HBaseStorage(
            'cf1:user_id cf1:ad_id',
            '-minTimestamp=1470024000000 -maxTimestamp=1491019199000 
-regex=\\\\|agds=(156)\\\\|'
          ) AS (user_id:chararray, ad_id:chararray);
side_b = FILTER side_b BY ad_id == '439';

side_b = JOIN
              side_a BY user_id,
              side_b BY user_id
               USING 'merge';
after_merge_join = FOREACH side_b GENERATE
                side_b::user_id;

STORE after_merge_join
  INTO 'hbase://results'
  USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('', '');

{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to