[jira] [Commented] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results
[ https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639881#comment-16639881 ] Jun Zheng commented on SPARK-25648: --- Hi [~hyukjin.kwon] Here is brief steps: # Use the data generation followed by readme in [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench], # then try to do VALIDATE_POWER_TEST, which set workload=ENGINE_VALIDATION_POWER_TEST in the file https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench/blob/master/conf/bigBench.properties # when execute the q22 , and the valiation fails, the detailed the sql listed in [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench/blob/master/engines/spark/queries/q22/q22.sql] , but i use the hive to execute the same sql in HIVE, the validation is OK. There is some results lost with the parameter _spark_.sql._orc_.impl set to native, and the returned row count is less then the result count returned by HIVE. Thanks ALL. > Spark 2.3.1 reads orc format files with native and hive, and return > different results > -- > > Key: SPARK-25648 > URL: https://issues.apache.org/jira/browse/SPARK-25648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jun Zheng >Priority: Major > > Hi All > I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the > code from > [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] > # The test data are loaded by spark-sql, the parameter > _spark_.sql._orc_.impl sets to native; > # During the engine validation power test, when use the different read > engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = > native, the q22 return different results. When set to hive, the result is > right, but set to native, less results are returned. Can someone help to find > why it happens. > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results
[ https://issues.apache.org/jira/browse/SPARK-25648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Zheng updated SPARK-25648: -- Description: Hi All I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the code from [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] # The test data are loaded by spark-sql, the parameter _spark_.sql._orc_.impl sets to native; # During the engine validation power test, when use the different read engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = native, the q22 return different results. When set to hive, the result is right, but set to native, less results are returned. Can someone help to find why it happens. Thanks in advance was: Hi All I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the code from [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] # The test data are loaded by spark-sql, the parameter _spark_.sql._orc_.impl sets to native; # During the engine validation power test, when use the different read engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = native, the q02 return different results. When set to hive, the result is right, but set to native, less results are returned. Can someone help to find why it happens. Thanks in advance > Spark 2.3.1 reads orc format files with native and hive, and return > different results > -- > > Key: SPARK-25648 > URL: https://issues.apache.org/jira/browse/SPARK-25648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Jun Zheng >Priority: Major > > Hi All > I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the > code from > [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] > # The test data are loaded by spark-sql, the parameter > _spark_.sql._orc_.impl sets to native; > # During the engine validation power test, when use the different read > engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = > native, the q22 return different results. When set to hive, the result is > right, but set to native, less results are returned. Can someone help to find > why it happens. > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25648) Spark 2.3.1 reads orc format files with native and hive, and return different results
Jun Zheng created SPARK-25648: - Summary: Spark 2.3.1 reads orc format files with native and hive, and return different results Key: SPARK-25648 URL: https://issues.apache.org/jira/browse/SPARK-25648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Jun Zheng Hi All I am testing TPCx-BB[link title|www.tpc.org/tpcx-bb/default.asp] with the code from [https://github.com/BigData-Lab-Frankfurt/Big-Data-Benchmark-for-Big-Bench,] # The test data are loaded by spark-sql, the parameter _spark_.sql._orc_.impl sets to native; # During the engine validation power test, when use the different read engines that is set _spark_.sql._orc_.impl = hive or _spark_.sql._orc_.impl = native, the q02 return different results. When set to hive, the result is right, but set to native, less results are returned. Can someone help to find why it happens. Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189472#comment-15189472 ] Jun Zheng commented on SPARK-12347: --- Thanks, can you assign this task to me? > Write script to run all MLlib examples for testing > -- > > Key: SPARK-12347 > URL: https://issues.apache.org/jira/browse/SPARK-12347 > Project: Spark > Issue Type: Test > Components: ML, MLlib, PySpark, SparkR, Tests >Reporter: Joseph K. Bradley > > It would facilitate testing to have a script which runs all MLlib examples > for all languages. > Design sketch to ensure all examples are run: > * Generate a list of examples to run programmatically (not from a fixed list). > * Use a list of special examples to handle examples which require command > line arguments. > * Make sure data, etc. used are small to keep the tests quick. > This could be broken into subtasks for each language, though it would be nice > to provide a single script. > Not sure where the script should live; perhaps in {{bin/}}? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080564#comment-15080564 ] Jun Zheng commented on SPARK-12347: --- 1. How to programmatically detect if a test requires input? I can see a ".required()" keyword in OptionParser indicates the test needs input, but not all test that needs input have this keyword. 2. How to set input file names other than hard-coding? > Write script to run all MLlib examples for testing > -- > > Key: SPARK-12347 > URL: https://issues.apache.org/jira/browse/SPARK-12347 > Project: Spark > Issue Type: Test > Components: ML, MLlib, PySpark, SparkR, Tests >Reporter: Joseph K. Bradley > > It would facilitate testing to have a script which runs all MLlib examples > for all languages. > Design sketch to ensure all examples are run: > * Generate a list of examples to run programmatically (not from a fixed list). > * Use a list of special examples to handle examples which require command > line arguments. > * Make sure data, etc. used are small to keep the tests quick. > This could be broken into subtasks for each language, though it would be nice > to provide a single script. > Not sure where the script should live; perhaps in {{bin/}}? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11665) Support other distance metrics for bisecting k-means
[ https://issues.apache.org/jira/browse/SPARK-11665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006793#comment-15006793 ] Jun Zheng commented on SPARK-11665: --- If no one else is interested, can you assign to me? > Support other distance metrics for bisecting k-means > > > Key: SPARK-11665 > URL: https://issues.apache.org/jira/browse/SPARK-11665 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > > Some guys reqested me to support other distance metrics, such as cosine > distance, tanimoto distance, in bisecting k-means. > We should > - desing the interfaces for distance metrics > - support the distances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11665) Support other distance metrics for bisecting k-means
[ https://issues.apache.org/jira/browse/SPARK-11665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003186#comment-15003186 ] Jun Zheng commented on SPARK-11665: --- In bisecting k-means and regular k-means, the distance metric is all Euclidean. Should we support these new metrics choices for both of them? > Support other distance metrics for bisecting k-means > > > Key: SPARK-11665 > URL: https://issues.apache.org/jira/browse/SPARK-11665 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > > Some guys reqested me to support other distance metrics, such as cosine > distance, tanimoto distance, in bisecting k-means. > We should > - desing the interfaces for distance metrics > - support the distances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11560) Optimize KMeans implementation
[ https://issues.apache.org/jira/browse/SPARK-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994556#comment-14994556 ] Jun Zheng commented on SPARK-11560: --- By simplification, do you mean we assume var "runs" in "initRandom", "initKMeansParallel" are always 1? > Optimize KMeans implementation > -- > > Key: SPARK-11560 > URL: https://issues.apache.org/jira/browse/SPARK-11560 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.7.0 >Reporter: Xiangrui Meng > > After we dropped `runs`, we can simplify and optimize the k-means > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org