[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results

2018-10-05 Thread Jun Zheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639881#comment-16639881
 ] 

Jun Zheng commented on SPARK-25648:
---

Hi [~hyukjin.kwon] 

Here is brief steps:
 # Use the data generation followed by readme in  
[https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench],
 # then try to do VALIDATE_POWER_TEST, which set 
workload=ENGINE_VALIDATION_POWER_TEST in the file  
https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench/blob/master/conf/bigBench.properties
 # when execute the q22 , and  the valiation fails, the detailed the sql listed 
in 
[https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench/blob/master/engines/spark/queries/q22/q22.sql]
 , but i use the hive to execute the same sql in HIVE, the validation is OK. 
There is some results lost with the parameter _spark_.sql._orc_.impl set to 
native, and the returned row count is less then the result count returned by 
HIVE.

Thanks ALL.

> Spark 2.3.1 reads orc format  files with native and hive, and return 
> different results
> --
>
> Key: SPARK-25648
> URL: https://issues.apache.org/jira/browse/SPARK-25648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jun Zheng
>Priority: Major
>
> Hi All
> I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the 
> code from 
> [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] 
>  # The test data are loaded by spark-sql, the parameter 
> _spark_.sql._orc_.impl sets to native;
>  # During the engine validation power test,  when use the different read 
> engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = 
> native, the q22 return different results. When set to hive,  the result is 
> right, but set to native, less results are returned. Can someone help to find 
> why it happens.
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results

2018-10-05 Thread Jun Zheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Zheng updated SPARK-25648:
--
Description: 
Hi All

I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the code 
from 
[https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] 
 # The test data are loaded by spark-sql, the parameter _spark_.sql._orc_.impl 
sets to native;
 # During the engine validation power test,  when use the different read 
engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = 
native, the q22 return different results. When set to hive,  the result is 
right, but set to native, less results are returned. Can someone help to find 
why it happens.

Thanks in advance

  was:
Hi All

I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the code 
from 
[https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] 
 # The test data are loaded by spark-sql, the parameter _spark_.sql._orc_.impl 
sets to native;
 # During the engine validation power test,  when use the different read 
engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = 
native, the q02 return different results. When set to hive,  the result is 
right, but set to native, less results are returned. Can someone help to find 
why it happens.

Thanks in advance


> Spark 2.3.1 reads orc format  files with native and hive, and return 
> different results
> --
>
> Key: SPARK-25648
> URL: https://issues.apache.org/jira/browse/SPARK-25648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jun Zheng
>Priority: Major
>
> Hi All
> I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the 
> code from 
> [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] 
>  # The test data are loaded by spark-sql, the parameter 
> _spark_.sql._orc_.impl sets to native;
>  # During the engine validation power test,  when use the different read 
> engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = 
> native, the q22 return different results. When set to hive,  the result is 
> right, but set to native, less results are returned. Can someone help to find 
> why it happens.
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results

2018-10-05 Thread Jun Zheng (JIRA)
Jun Zheng created SPARK-25648:
-

 Summary: Spark 2.3.1 reads orc format  files with native and hive, 
and return different results
 Key: SPARK-25648
 URL: https://issues.apache.org/jira/browse/SPARK-25648
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Jun Zheng


Hi All

I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the code 
from 
[https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] 
 # The test data are loaded by spark-sql, the parameter _spark_.sql._orc_.impl 
sets to native;
 # During the engine validation power test,  when use the different read 
engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = 
native, the q02 return different results. When set to hive,  the result is 
right, but set to native, less results are returned. Can someone help to find 
why it happens.

Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing

2016-03-10 Thread Jun Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189472#comment-15189472
 ] 

Jun Zheng commented on SPARK-12347:
---

Thanks, can you assign this task to me?

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing

2016-01-03 Thread Jun Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080564#comment-15080564
 ] 

Jun Zheng commented on SPARK-12347:
---

1. How to programmatically detect if a test requires input? I can see a 
".required()" keyword in OptionParser indicates the test needs input, but not 
all test that needs input have this keyword.

2. How to set input file names other than hard-coding?

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11665) Support other distance metrics for bisecting k-means

2015-11-16 Thread Jun Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006793#comment-15006793
 ] 

Jun Zheng commented on SPARK-11665:
---

If no one else is interested, can you assign to me?

> Support other distance metrics for bisecting k-means
> 
>
> Key: SPARK-11665
> URL: https://issues.apache.org/jira/browse/SPARK-11665
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>
> Some guys reqested me to support other distance metrics, such as cosine 
> distance, tanimoto distance, in bisecting k-means. 
> We should
> - desing the interfaces for distance metrics
> - support the distances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11665) Support other distance metrics for bisecting k-means

2015-11-12 Thread Jun Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003186#comment-15003186
 ] 

Jun Zheng commented on SPARK-11665:
---

In bisecting k-means and regular k-means, the distance metric is all Euclidean. 
Should we support these new metrics choices for both of them?

> Support other distance metrics for bisecting k-means
> 
>
> Key: SPARK-11665
> URL: https://issues.apache.org/jira/browse/SPARK-11665
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>
> Some guys reqested me to support other distance metrics, such as cosine 
> distance, tanimoto distance, in bisecting k-means. 
> We should
> - desing the interfaces for distance metrics
> - support the distances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11560) Optimize KMeans implementation

2015-11-06 Thread Jun Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994556#comment-14994556
 ] 

Jun Zheng commented on SPARK-11560:
---

By simplification, do you mean we assume var "runs" in "initRandom", 
"initKMeansParallel" are always 1?

> Optimize KMeans implementation
> --
>
> Key: SPARK-11560
> URL: https://issues.apache.org/jira/browse/SPARK-11560
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.7.0
>Reporter: Xiangrui Meng
>
> After we dropped `runs`, we can simplify and optimize the k-means 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org