[
https://issues.apache.org/jira/browse/PIG-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789188#action_12789188
]
Jing Huang commented on PIG-1145:
---------------------------------
found another failure on merge join
This merge join script failed:
register $zebraJar;
--fs -rmr $outputDir
--a1 = LOAD '$inputDir/unsorted1' USING
org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2');
--a2 = LOAD '$inputDir/unsorted2' USING
org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2');
--sort1 = order a1 by byte2;
--sort2 = order a2 by byte2;
--store sort1 into '$outputDir/100Msortedbyte21' using
org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2];[byte2]');
--store sort2 into '$outputDir/100Msortedbyte22' using
org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2];[byte2]');
rec1 = load '$outputDir/100Msortedbyte21' using
org.apache.hadoop.zebra.pig.TableLoader('','sorted');
rec2 = load '$outputDir/100Msortedbyte22' using
org.apache.hadoop.zebra.pig.TableLoader('','sorted');
joina = join rec1 by byte2, rec2 by byte2 using "merge" ;
E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2,
$4 as byte2;
store E into '$outputDir/bad1' using
org.apache.hadoop.zebra.pig.TableStorer('');
=========
instead, this similiar script works with the previous patch:
register $zebraJar;
--fs -rmr $outputDir
a1 = LOAD '$inputDir/unsorted1' USING
org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2');
a2 = LOAD '$inputDir/unsorted2' USING
org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2,byte2');
sort1 = order a1 by byte2;
sort2 = order a2 by byte2;
store sort1 into '$outputDir/100Msortedbyte21' using
org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2,byte2]');
store sort2 into '$outputDir/100Msortedbyte22' using
org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2,byte2]');
rec1 = load '$outputDir/100Msortedbyte21' using
org.apache.hadoop.zebra.pig.TableLoader('','sorted');
rec2 = load '$outputDir/100Msortedbyte22' using
org.apache.hadoop.zebra.pig.TableLoader('','sorted');
joina = join rec1 by byte2, rec2 by byte2 using "merge" ;
E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as str2,
$4 as byte2;
store E into '$outputDir/join3' using
org.apache.hadoop.zebra.pig.TableStorer('');
~
================
Here is stack trace:
Backend error message
---------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error
processing right input during merge join
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:159)
Caused by: java.io.EOFException: No key-value to read
at
org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590)
at
org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611)
at
org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
at
org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
at
org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1083)
at
org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
... 9 more
=============
This is how I run it (i disabled pruning to simply the possible problem)
java -cp
/grid/0/dev/hadoopqa/jing1234/conf:/grid/0/dev/hadoopqa/jars/pig.jar:/grid/0/dev/hadoopqa/jars/tfile.jar:/grid/0/dev/hadoopqa/jars/zebra.jar
org.apache.pig.Main -m config -M -t PruneColumns bad_join.pig
> [zebra] merge join on large table ( 100,000.000 rows zebra table) failed
> ------------------------------------------------------------------------
>
> Key: PIG-1145
> URL: https://issues.apache.org/jira/browse/PIG-1145
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.6.0, 0.7.0
> Reporter: Jing Huang
> Assignee: Yan Zhou
> Fix For: 0.6.0, 0.7.0
>
> Attachments: PIG-1145.patch, PIG-1145.patch
>
>
> Pig script :
> register $zebraJar;
> --fs -rmr $outputDir
> a1 = LOAD '$inputDir/unsorted1' USING
> org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
> a2 = LOAD '$inputDir/unsorted2' USING
> org.apache.hadoop.zebra.pig.TableLoader('count,seed,int1,str2');
> sort1 = order a1 by str2;
> sort2 = order a2 by str2;
> --store sort1 into '$outputDir/sorted11' using
> org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
> --store sort2 into '$outputDir/sorted21' using
> org.apache.hadoop.zebra.pig.TableStorer('[count,seed,int1,str2]');
> rec1 = load '$outputDir/sorted11' using
> org.apache.hadoop.zebra.pig.TableLoader();
> rec2 = load '$outputDir/sorted21' using
> org.apache.hadoop.zebra.pig.TableLoader();
> joina = join rec1 by str2, rec2 by str2 using "merge" ;
> --E = foreach joina generate $0 as count, $1 as seed, $2 as int1, $3 as
> str2;
> store joina into '$outputDir/join1' using
> org.apache.hadoop.zebra.pig.TableStorer('');
> ~
>
>
> ~
>
>
> ~
> ======
> stacktrace:
> org.apache.pig.backend.executionengine.ExecException: ERROR 2176: Error
> processing right input during merge join at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.throwProcessingException(POMergeJoin.java:453)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:443)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:337)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.close(PigMapBase.java:107)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at
> org.apache.hadoop.mapred.Child.main(Child.java:159) Caused by:
> java.io.EOFException: No key-value to read at
> org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.checkKey(TFile.java:1590)
> at org.apache.hadoop.zebra.tfile.TFile$Reader$Scanner.entry(TFile.java:1611)
> at
> org.apache.hadoop.zebra.io.ColumnGroup$Reader$TFileScanner.getKey(ColumnGroup.java:854)
> at
> org.apache.hadoop.zebra.io.ColumnGroup$Reader$CGScanner.getCGKey(ColumnGroup.java:1035)
> at
> org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getKey(BasicTable.java:1082)
> at
> org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableRecordReader.java:105)
> at org.apache.hadoop.zebra.pig.TableLoader.getNext(TableLoader.java:414) at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextRightInp(POMergeJoin.java:415)
> ... 7 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.