liyunzhang_intel created PIG-4345: ------------------------------------- Summary: e2e test "RubyUDFs_13" fails because of the different result of "group a all" in different engines like "spark", "mapreduce" Key: PIG-4345 URL: https://issues.apache.org/jira/browse/PIG-4345 Project: Pig Issue Type: Bug Reporter: liyunzhang_intel Assignee: liyunzhang_intel
RubyUDFs e2e scrip is on the line 3818 of nightly.conf : {code} 'num' => 13, 'java_params' => ['-Dpig.accumulative.batchsize=5'], 'pig' => q\ register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs; a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double); b = foreach (group a all) generate FLATTEN(myfuncs.AppendIndex(a)); store b into ':OUTPATH:';\, 'verify_pig_script' => q\ register :FUNCPATH:/testudf.jar; a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double); b = foreach (group a all) generate FLATTEN(org.apache.pig.test.udf.evalfunc.AppendIndex(a)); store b into ':OUTPATH:';\, }, ] }, {code} RubyUDFs_13.pig tests ruby udf "AppendIndex" in "morerubyudfs.rb". The output is compared with verified script which use java udf "org.apache.pig.test.udf.evalfunc.AppendIndex". The output of "RubyUDFs_13.pig" is like following: If test file “studemttab10k” is tom thompson 42 0.53 nick johnson 34 0.47 priscilla falkner 55 1.16 the result in spark engine will be: tom thompson 42 0.53 1 nick johnson 34 0.47 2 priscilla falkner 55 1.16 3 the result in mapreduce engine which verified script uses will be priscilla falkner 55 1.16 1 nick johnson 34 0.47 2 tom thompson 42 0.53 3 The difference between the result in spark and mapreduce engine cause RubyUDFs_13 e2e test failure . The root cause of the difference is because “group a all” has different result in different engines. In Spark engine, “group a all” : all { (tom thompson 42 0.53),( nick johnson 34 0.47),( priscilla falkner 55 1.16)} In mapreduce engine , “group a all”: all {( priscilla falkner 55 1.16), ( nick johnson 34 0.47),(tom thompson 42 0.53)} If the test script is modified like following, RubyUDF_13 e2e test passes. {code} { 'num' => 13, 'java_params' => ['-Dpig.accumulative.batchsize=5'], 'pig' => q\ register ':SCRIPTHOMEPATH:/ruby/morerubyudfs.rb' using jruby as myfuncs; a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double); a1 = filter a by name == 'nick johnson'; a2 = filter a1 by age == 34; b = foreach (group a2 all) generate FLATTEN(myfuncs.AppendIndex(a2)); store b into ':OUTPATH:';\, 'verify_pig_script' => q\ register :FUNCPATH:/testudf.jar; a = load ':INPATH:/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double); a1 = filter a by name == 'nick johnson'; a2 = filter a1 by age == 34; b = foreach (group a2 all) generate FLATTEN(org.apache.pig.test.udf.evalfunc.AppendIndex(a2)); store b into ':OUTPATH:';\, }, ] }, {code} using modified test script, the result in spark and mapreduce engine will be: nick johnson 34 0.47 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)