[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results
[ https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641377#comment-16641377 ] Hyukjin Kwon commented on SPARK-25648: -- {quote} There is some results lost with the parameter spark.sql.orc.impl set to native, and the returned row count is less then the result count returned by HIVE. {quote} Can you make a self-contained reproducer? Also, it might be better, at least, to describe symptoms as detailed as possible, for instance, exactly which results were different. You can leave some related logs and console output as well. Basically we should file an issue here in JIRA rather than asking investigations. > Spark 2.3.1 reads orc format files with native and hive, and return > different results > -- > > Key: SPARK-25648 > URL: https://issues.apache.org/jira/browse/SPARK-25648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jun Zheng >Priority: Major > > Hi All > I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the > code from > [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] > # The test data are loaded by spark-sql, the parameter > _spark_.sql._orc_.impl sets to native; > # During the engine validation power test, when use the different read > engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = > native, the q22 return different results. When set to hive, the result is > right, but set to native, less results are returned. Can someone help to find > why it happens. > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results
[ https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640460#comment-16640460 ] Dongjoon Hyun commented on SPARK-25648: --- Could you try Apache Spark 2.3.2, too? > Spark 2.3.1 reads orc format files with native and hive, and return > different results > -- > > Key: SPARK-25648 > URL: https://issues.apache.org/jira/browse/SPARK-25648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jun Zheng >Priority: Major > > Hi All > I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the > code from > [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] > # The test data are loaded by spark-sql, the parameter > _spark_.sql._orc_.impl sets to native; > # During the engine validation power test, when use the different read > engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = > native, the q22 return different results. When set to hive, the result is > right, but set to native, less results are returned. Can someone help to find > why it happens. > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results
[ https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639998#comment-16639998 ] Dongjoon Hyun commented on SPARK-25648: --- Thank you for reporting, [~justinnju]. How about Parquet result? Since Spark's default data source is Parquet, we had better compare with Parquet. > Spark 2.3.1 reads orc format files with native and hive, and return > different results > -- > > Key: SPARK-25648 > URL: https://issues.apache.org/jira/browse/SPARK-25648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jun Zheng >Priority: Major > > Hi All > I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the > code from > [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] > # The test data are loaded by spark-sql, the parameter > _spark_.sql._orc_.impl sets to native; > # During the engine validation power test, when use the different read > engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = > native, the q22 return different results. When set to hive, the result is > right, but set to native, less results are returned. Can someone help to find > why it happens. > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results
[ https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639881#comment-16639881 ] Jun Zheng commented on SPARK-25648: --- Hi [~hyukjin.kwon] Here is brief steps: # Use the data generation followed by readme in [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench], # then try to do VALIDATE_POWER_TEST, which set workload=ENGINE_VALIDATION_POWER_TEST in the file https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench/blob/master/conf/bigBench.properties # when execute the q22 , and the valiation fails, the detailed the sql listed in [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench/blob/master/engines/spark/queries/q22/q22.sql] , but i use the hive to execute the same sql in HIVE, the validation is OK. There is some results lost with the parameter _spark_.sql._orc_.impl set to native, and the returned row count is less then the result count returned by HIVE. Thanks ALL. > Spark 2.3.1 reads orc format files with native and hive, and return > different results > -- > > Key: SPARK-25648 > URL: https://issues.apache.org/jira/browse/SPARK-25648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jun Zheng >Priority: Major > > Hi All > I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the > code from > [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] > # The test data are loaded by spark-sql, the parameter > _spark_.sql._orc_.impl sets to native; > # During the engine validation power test, when use the different read > engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = > native, the q22 return different results. When set to hive, the result is > right, but set to native, less results are returned. Can someone help to find > why it happens. > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results
[ https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639824#comment-16639824 ] Hyukjin Kwon commented on SPARK-25648: -- [~justinnju] What results were different? How can we reproduce this? Was this dataloss? or correctness issue? > Spark 2.3.1 reads orc format files with native and hive, and return > different results > -- > > Key: SPARK-25648 > URL: https://issues.apache.org/jira/browse/SPARK-25648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jun Zheng >Priority: Major > > Hi All > I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the > code from > [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] > # The test data are loaded by spark-sql, the parameter > _spark_.sql._orc_.impl sets to native; > # During the engine validation power test, when use the different read > engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = > native, the q02 return different results. When set to hive, the result is > right, but set to native, less results are returned. Can someone help to find > why it happens. > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results
[ https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639723#comment-16639723 ] Marco Gaido commented on SPARK-25648: - cc [~dongjoon] > Spark 2.3.1 reads orc format files with native and hive, and return > different results > -- > > Key: SPARK-25648 > URL: https://issues.apache.org/jira/browse/SPARK-25648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jun Zheng >Priority: Major > > Hi All > I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the > code from > [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] > # The test data are loaded by spark-sql, the parameter > _spark_.sql._orc_.impl sets to native; > # During the engine validation power test, when use the different read > engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = > native, the q02 return different results. When set to hive, the result is > right, but set to native, less results are returned. Can someone help to find > why it happens. > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org