[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results

2018-10-07 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641377#comment-16641377
 ] 

Hyukjin Kwon commented on SPARK-25648:
--

{quote}
There is some results lost with the parameter spark.sql.orc.impl set to native, 
and the returned row count is less then the result count returned by HIVE.
{quote}

Can you make a self-contained reproducer? Also, it might be better, at least, 
to describe symptoms as detailed as possible, for instance, exactly which 
results were different. You can leave some related logs and console output as 
well.

Basically we should file an issue here in JIRA rather than asking 
investigations.

> Spark 2.3.1 reads orc format  files with native and hive, and return 
> different results
> --
>
> Key: SPARK-25648
> URL: https://issues.apache.org/jira/browse/SPARK-25648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jun Zheng
>Priority: Major
>
> Hi All
> I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the 
> code from 
> [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] 
>  # The test data are loaded by spark-sql, the parameter 
> _spark_.sql._orc_.impl sets to native;
>  # During the engine validation power test,  when use the different read 
> engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = 
> native, the q22 return different results. When set to hive,  the result is 
> right, but set to native, less results are returned. Can someone help to find 
> why it happens.
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results

2018-10-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640460#comment-16640460
 ] 

Dongjoon Hyun commented on SPARK-25648:
---

Could you try Apache Spark 2.3.2, too?

> Spark 2.3.1 reads orc format  files with native and hive, and return 
> different results
> --
>
> Key: SPARK-25648
> URL: https://issues.apache.org/jira/browse/SPARK-25648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jun Zheng
>Priority: Major
>
> Hi All
> I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the 
> code from 
> [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] 
>  # The test data are loaded by spark-sql, the parameter 
> _spark_.sql._orc_.impl sets to native;
>  # During the engine validation power test,  when use the different read 
> engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = 
> native, the q22 return different results. When set to hive,  the result is 
> right, but set to native, less results are returned. Can someone help to find 
> why it happens.
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results

2018-10-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639998#comment-16639998
 ] 

Dongjoon Hyun commented on SPARK-25648:
---

Thank you for reporting, [~justinnju].

How about Parquet result? Since Spark's default data source is Parquet, we had 
better compare with Parquet.

> Spark 2.3.1 reads orc format  files with native and hive, and return 
> different results
> --
>
> Key: SPARK-25648
> URL: https://issues.apache.org/jira/browse/SPARK-25648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jun Zheng
>Priority: Major
>
> Hi All
> I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the 
> code from 
> [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] 
>  # The test data are loaded by spark-sql, the parameter 
> _spark_.sql._orc_.impl sets to native;
>  # During the engine validation power test,  when use the different read 
> engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = 
> native, the q22 return different results. When set to hive,  the result is 
> right, but set to native, less results are returned. Can someone help to find 
> why it happens.
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results

2018-10-05 Thread Jun Zheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639881#comment-16639881
 ] 

Jun Zheng commented on SPARK-25648:
---

Hi [~hyukjin.kwon] 

Here is brief steps:
 # Use the data generation followed by readme in  
[https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench],
 # then try to do VALIDATE_POWER_TEST, which set 
workload=ENGINE_VALIDATION_POWER_TEST in the file  
https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench/blob/master/conf/bigBench.properties
 # when execute the q22 , and  the valiation fails, the detailed the sql listed 
in 
[https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench/blob/master/engines/spark/queries/q22/q22.sql]
 , but i use the hive to execute the same sql in HIVE, the validation is OK. 
There is some results lost with the parameter _spark_.sql._orc_.impl set to 
native, and the returned row count is less then the result count returned by 
HIVE.

Thanks ALL.

> Spark 2.3.1 reads orc format  files with native and hive, and return 
> different results
> --
>
> Key: SPARK-25648
> URL: https://issues.apache.org/jira/browse/SPARK-25648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jun Zheng
>Priority: Major
>
> Hi All
> I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the 
> code from 
> [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] 
>  # The test data are loaded by spark-sql, the parameter 
> _spark_.sql._orc_.impl sets to native;
>  # During the engine validation power test,  when use the different read 
> engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = 
> native, the q22 return different results. When set to hive,  the result is 
> right, but set to native, less results are returned. Can someone help to find 
> why it happens.
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results

2018-10-05 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639824#comment-16639824
 ] 

Hyukjin Kwon commented on SPARK-25648:
--

[~justinnju] What results were different? How can we reproduce this? Was this 
dataloss? or correctness issue?

> Spark 2.3.1 reads orc format  files with native and hive, and return 
> different results
> --
>
> Key: SPARK-25648
> URL: https://issues.apache.org/jira/browse/SPARK-25648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jun Zheng
>Priority: Major
>
> Hi All
> I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the 
> code from 
> [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] 
>  # The test data are loaded by spark-sql, the parameter 
> _spark_.sql._orc_.impl sets to native;
>  # During the engine validation power test,  when use the different read 
> engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = 
> native, the q02 return different results. When set to hive,  the result is 
> right, but set to native, less results are returned. Can someone help to find 
> why it happens.
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results

2018-10-05 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639723#comment-16639723
 ] 

Marco Gaido commented on SPARK-25648:
-

cc [~dongjoon]

> Spark 2.3.1 reads orc format  files with native and hive, and return 
> different results
> --
>
> Key: SPARK-25648
> URL: https://issues.apache.org/jira/browse/SPARK-25648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jun Zheng
>Priority: Major
>
> Hi All
> I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the 
> code from 
> [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] 
>  # The test data are loaded by spark-sql, the parameter 
> _spark_.sql._orc_.impl sets to native;
>  # During the engine validation power test,  when use the different read 
> engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = 
> native, the q02 return different results. When set to hive,  the result is 
> right, but set to native, less results are returned. Can someone help to find 
> why it happens.
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org