[ https://issues.apache.org/jira/browse/PIG-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liyunzhang_intel updated PIG-4345: ---------------------------------- Attachment: PIG-4345_1.patch Thanks [~rohini]'s comment. I have uploaded PIG-4345_1.patch. The changes are: {code} register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs; a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa :double); -b = foreach (group a all) generate FLATTEN(myfuncs.AppendIndex(a)); +b = foreach (group a all){ + a1= order a by name,age,gpa; + generate FLATTEN(myfuncs.AppendIndex(a1)); +} {code} When ordering by name,age and gpa these three field, the output of b in 'spark' and 'mapreduce' engines are same. > e2e test "RubyUDFs_13" fails because of the different result of "group a all" > in different engines like "spark", "mapreduce" > ---------------------------------------------------------------------------------------------------------------------------- > > Key: PIG-4345 > URL: https://issues.apache.org/jira/browse/PIG-4345 > Project: Pig > Issue Type: Bug > Reporter: liyunzhang_intel > Assignee: liyunzhang_intel > Attachments: PIG-4345.patch, PIG-4345_1.patch > > > RubyUDFs e2e scrip is on the line 3818 of nightly.conf : > {code} > 'num' => 13, > 'java_params' => ['-Dpig.accumulative.batchsize=5'], > 'pig' => q\ > register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs; > a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, > age:int, gpa:double); > b = foreach (group a all) generate FLATTEN(myfuncs.AppendIndex(a)); > store b into ':OUTPATH:';\, > 'verify_pig_script' => q\ > register :FUNCPATH:/testudf.jar; > a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, > age:int, gpa:double); > b = foreach (group a all) generate > FLATTEN(org.apache.pig.test.udf.evalfunc.AppendIndex(a)); > store b into ':OUTPATH:';\, > }, > ] > }, > {code} > RubyUDFs_13.pig tests ruby udf "AppendIndex" in "morerubyudfs.rb". The > output is compared with verified script which use java udf > "org.apache.pig.test.udf.evalfunc.AppendIndex". The output of > "RubyUDFs_13.pig" is like following: > If test file “studemttab10k” is > tom thompson 42 0.53 > nick johnson 34 0.47 > priscilla falkner 55 1.16 > the result in spark engine will be: > tom thompson 42 0.53 1 > nick johnson 34 0.47 2 > priscilla falkner 55 1.16 3 > the result in mapreduce engine which verified script uses will be > priscilla falkner 55 1.16 1 > nick johnson 34 0.47 2 > tom thompson 42 0.53 3 > The difference between the result in spark and mapreduce engine cause > RubyUDFs_13 e2e test failure . > The root cause of the difference is because “group a all” has different > result in different engines. > In Spark engine, “group a all” : > all { (tom thompson 42 0.53),( nick johnson 34 0.47),( > priscilla falkner 55 1.16)} > In mapreduce engine , “group a all”: > all {( priscilla falkner 55 1.16), ( nick johnson 34 > 0.47),(tom thompson 42 0.53)} > Using PIG-4345.patch, RubyUDF_13 e2e test passes. > {code} > { > 'num' => 13, > 'java_params' => ['-Dpig.accumulative.batchsize=5'], > 'pig' => q\ > register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs; > a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, > age:int, gpa:double); > a1 = filter a by name == 'nick johnson'; > a2 = filter a1 by age == 34; > b = foreach (group a2 all) generate FLATTEN(myfuncs.AppendIndex(a2)); > store b into ':OUTPATH:';\, > 'verify_pig_script' => q\ > register :FUNCPATH:/testudf.jar; > a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, > age:int, gpa:double); > a1 = filter a by name == 'nick johnson'; > a2 = filter a1 by age == 34; > b = foreach (group a2 all) generate > FLATTEN(org.apache.pig.test.udf.evalfunc.AppendIndex(a2)); > store b into ':OUTPATH:';\, > }, > ] > }, > {code} > using PIG-4345.patch, the result in spark and mapreduce engine will be: > nick johnson 34 0.47 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)