[GitHub] spark issue #22479: [MINOR][PYTHON][TEST] Use collect() instead of show() to...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22479 Thanks @HyukjinKwon. LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22479: [MINOR][PYTHON][TEST] Use collect() instead of sh...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22479#discussion_r219053623 --- Diff: python/pyspark/sql/tests.py --- @@ -1168,7 +1168,7 @@ def test_simple_udt_in_df(self): df = self.spark.createDataFrame( [(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema) -df.show() +df.collect() --- End diff -- LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22483: [MINOR][PYTHON] Use a helper in `PythonUtils` ins...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/22483 [MINOR][PYTHON] Use a helper in `PythonUtils` instead of direct accessing Scala package ## What changes were proposed in this pull request? This PR proposes to use add a helper in `PythonUtils` instead of direct accessing Scala package. ## How was this patch tested? Jenkins tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark minor-refactoring Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22483.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22483 commit cce9d4d1bb6c297e15dbec5b53f8ed3163e88d9c Author: hyukjinkwon Date: 2018-09-20T06:34:54Z Minor refactoring --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21596: [SPARK-24601] Update Jackson to 2.9.6
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21596 **[Test build #96330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96330/testReport)** for PR 21596 at commit [`44b8d1b`](https://github.com/apache/spark/commit/44b8d1b73cf2cc83b4ebfcc11ccf12951878f2d6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22460 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22460 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96327/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22479: [MINOR][PYTHON][TEST] Use collect() instead of sh...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22479#discussion_r219052270 --- Diff: python/pyspark/sql/tests.py --- @@ -1168,7 +1168,7 @@ def test_simple_udt_in_df(self): df = self.spark.createDataFrame( [(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema) -df.show() +df.collect() --- End diff -- cc @viirya since this is added [here](https://github.com/apache/spark/commit/146001a9ffefc7aaedd3d888d68c7a9b80bca545#diff-7c2fe8530271c0635fb99f7b49e0c4a4R583). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22399: [SPARK-25408] Move to mode ideomatic Java8
Github user Fokko commented on the issue: https://github.com/apache/spark/pull/22399 @srowen Any incentive to move this forward? Or are PR's like these not appreciated? Let me know. Most of the changes are cosmetic, but having https://github.com/apache/spark/pull/22399/files#diff-6c2c45f79666e2e52eb9f9411fa8b4baR49 makes the codebase a bit nicer in my opinion, since the class already has a `.close()` method, it makes sense to also implement the `Closeable` interface. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22443 Thank you, @gengliangwang ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22408: [SPARK-25417][SQL] ArrayContains function may return inc...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22408 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22443 @dongjoon-hyun No problem. I was waiting for this PR to be merged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22460 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96310/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22460 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22460 **[Test build #96310 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96310/testReport)** for PR 22460 at commit [`09baf06`](https://github.com/apache/spark/commit/09baf06505f9da34cdcccdffcc1a4061ed825f44). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22460 **[Test build #4344 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4344/testReport)** for PR 22460 at commit [`4106040`](https://github.com/apache/spark/commit/410604012cbd1c9e7c284a1e05f95b3827c728a5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22467: [SPARK-25465][TEST] Refactor Parquet test suites in proj...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22467 @xuanyuanking Thanks for the review! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22408: [SPARK-25417][SQL] ArrayContains function may return inc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22408 **[Test build #96329 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96329/testReport)** for PR 22408 at commit [`d79e9d4`](https://github.com/apache/spark/commit/d79e9d46bca28c721887625b89814e91e923e7ca). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22482 **[Test build #96328 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96328/testReport)** for PR 22482 at commit [`ad0b746`](https://github.com/apache/spark/commit/ad0b7466ef3f79354a99bd1b95c23e4c308502d5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22460 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3283/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22408: [SPARK-25417][SQL] ArrayContains function may return inc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22408 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22408: [SPARK-25417][SQL] ArrayContains function may return inc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22408 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3284/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22460 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22460 **[Test build #96327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96327/testReport)** for PR 22460 at commit [`3bdb38a`](https://github.com/apache/spark/commit/3bdb38aec74b08b135aa5976982c20f74aae9736). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22467: [SPARK-25465][TEST] Refactor Parquet test suites in proj...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22467 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22467: [SPARK-25465][TEST] Refactor Parquet test suites in proj...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22467 **[Test build #96326 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96326/testReport)** for PR 22467 at commit [`11d61a4`](https://github.com/apache/spark/commit/11d61a414ee41449feb2db744657696d79db5560). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22467: [SPARK-25465][TEST] Refactor Parquet test suites in proj...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22467 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3282/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22462: [SPARK-25460][SS] DataSourceV2: SS sources do not respec...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22462 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219039607 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,60 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23))"), --- End diff -- @cloud-fan Yes. it should :-) I think i had changed this test case to verify the fix to tighestCommonType.. and pushed it by mistake. Sorry about it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22462: [SPARK-25460][SS] DataSourceV2: SS sources do not respec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22462 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3281/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22462: [SPARK-25460][SS] DataSourceV2: SS sources do not respec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22462 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22462: [SPARK-25460][SS] DataSourceV2: SS sources do not respec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22462 **[Test build #96325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96325/testReport)** for PR 22462 at commit [`897cf69`](https://github.com/apache/spark/commit/897cf69a4b3c6eb07eb321c23644167c1bed211b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219039187 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,66 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), '1'); + + +true + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. --- End diff -- Ah then it's fine, we don't need to change anything here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219038798 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,60 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.0D)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.0D), 1)"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(1.23D), 1)"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.0D))"), + Seq(Row(true), Row(true)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23D))"), + Seq(Row(false), Row(false)) +) + +checkAnswer( + df.selectExpr("array_contains(array(array(1)), array(1.23))"), --- End diff -- hmm? shouldn't this fail because of the bug? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22482 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22482 **[Test build #96324 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96324/testReport)** for PR 22482 at commit [`7d8371c`](https://github.com/apache/spark/commit/7d8371c34fe275ba3186dc97d0844cfd90ba06ed). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22482 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96324/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22443 @gengliangwang . SPARK-25475 is created like the above, could you revise https://github.com/apache/spark/pull/22451 in order print the output as a separate file like this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22475: [SPARK-4502][SQL] Rename to spark.sql.optimizer.nestedSc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22475 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96309/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22475: [SPARK-4502][SQL] Rename to spark.sql.optimizer.nestedSc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22475 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219038350 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -735,6 +735,60 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { df.selectExpr("array_contains(array(1, null), array(1, null)[0])"), Seq(Row(true), Row(true)) ) + +checkAnswer( + df.selectExpr("array_contains(array(1), 1.23D)"), --- End diff -- this query doesn't read any data from `df`, so the 2 result rows are always same. Can we use `OneRowRelation` here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22475: [SPARK-4502][SQL] Rename to spark.sql.optimizer.nestedSc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22475 **[Test build #96309 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96309/testReport)** for PR 22475 at commit [`5159883`](https://github.com/apache/spark/commit/5159883f5b4a65ac8ecec8b0368e172680aa6897). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22482 **[Test build #96324 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96324/testReport)** for PR 22482 at commit [`7d8371c`](https://github.com/apache/spark/commit/7d8371c34fe275ba3186dc97d0844cfd90ba06ed). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22443 https://issues.apache.org/jira/browse/SPARK-25475 is created. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219037732 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,66 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), '1'); + + +true + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. --- End diff -- @cloud-fan Yeah, presto gives error. Please refer to my earlier comment showing the presto output. Did you want anything to change in the description ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219037301 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,66 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), '1'); + + +true + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. --- End diff -- If presto doesn't do it, we should follow it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22482 **[Test build #96323 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96323/testReport)** for PR 22482 at commit [`0072ebe`](https://github.com/apache/spark/commit/0072ebe1a46ff9d1230e18b33ca22c2f32cfb958). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22482 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22482 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96323/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219037194 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,66 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), '1'); + + +true + + +AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. --- End diff -- We can promote `int` to `string`, but I'm not sure that's a common behavior in other databases --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219037086 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,66 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. --- End diff -- remove `both`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22482 **[Test build #96323 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96323/testReport)** for PR 22482 at commit [`0072ebe`](https://github.com/apache/spark/commit/0072ebe1a46ff9d1230e18b33ca22c2f32cfb958). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/22482 The patch is a bit huge, so I'm not sure we would be better to squash commits into one before reviewing. Two TODOs are left hence marking the patch as WIP, but closer to be a complete patch: 1. Optimal implementation of state for session window. It borrowed the state implementation from streaming join since it fits the necessary concept of state for session window, but it may not be optimal one so I'm going to see we can have better implementation. 2. Javadoc (Maybe structured streaming guide doc too?) I didn't add javadoc yet to speed up POC and actual development, but to complete the patch I guess I need to write javadoc for new classes as well as methods (maybe). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22408: [SPARK-25417][SQL] ArrayContains function may return inc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22408 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3280/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22408: [SPARK-25417][SQL] ArrayContains function may return inc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22408 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22482 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22482 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96321/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22482 **[Test build #96321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96321/testReport)** for PR 22482 at commit [`fb19879`](https://github.com/apache/spark/commit/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22482 **[Test build #96321 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96321/testReport)** for PR 22482 at commit [`fb19879`](https://github.com/apache/spark/commit/fb19879ff2bbafdf7c844d1a8da9d30c07aefd76). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22408: [SPARK-25417][SQL] ArrayContains function may return inc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22408 **[Test build #96322 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96322/testReport)** for PR 22408 at commit [`df5ea47`](https://github.com/apache/spark/commit/df5ea4768781ac82927128b8dfeefb5ab421ee14). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22482 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22482: WIP - [SPARK-10816][SS] Support session window natively
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22482 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22365#discussion_r219034294 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -370,29 +370,76 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * @since 1.5.0 */ def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame = { +sampleBy(Column(col), fractions, seed) + } + + /** + * Returns a stratified sample without replacement based on the fraction given on each stratum. + * @param col column that defines strata + * @param fractions sampling fraction for each stratum. If a stratum is not specified, we treat + * its fraction as zero. + * @param seed random seed + * @tparam T stratum type + * @return a new `DataFrame` that represents the stratified sample + * + * @since 1.5.0 + */ + def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = { +sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed) + } + + /** + * Returns a stratified sample without replacement based on the fraction given on each stratum. + * @param col column that defines strata + * @param fractions sampling fraction for each stratum. If a stratum is not specified, we treat + * its fraction as zero. + * @param seed random seed + * @tparam T stratum type + * @return a new `DataFrame` that represents the stratified sample + * + * The stratified sample can be performed over multiple columns: + * {{{ + *import org.apache.spark.sql.Row + *import org.apache.spark.sql.functions.struct + * + *val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17), + * ("Alice", 10))).toDF("name", "age") + *val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0) + *df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show() + *+-+---+ + *| name|age| + *+-+---+ + *| Nico| 8| + *|Alice| 10| + *+-+---+ + * }}} + * + * @since 3.0.0 --- End diff -- the next release is 2.5.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22365 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22482: WIP - [SPARK-10816][SS] Support session window na...
GitHub user HeartSaVioR opened a pull request: https://github.com/apache/spark/pull/22482 WIP - [SPARK-10816][SS] Support session window natively ## What changes were proposed in this pull request? This patch proposes native support of session window, like Spark has been supporting for time window. Please refer the attached doc in [SPARK-10816](https://issues.apache.org/jira/browse/SPARK-10816) for more details on rationalization, concepts, and limitation, etc. In point of end users' view, only the change is addition of "session" SQL function. End users could define query with session window as replacing "window" function to "session" function, and "window" column to "session" column. After then the patch will provide same experience with time window. Internally, this patch will change the physical plan of aggregation a bit: if there's session function being used in query, it will sort the input rows as "grouping keys" + "session", and merge overlapped sessions into one with applying aggregations, so it's like a sort based aggregation but the unit of group is grouping keys + session. Due to handle late event, there's a case multiple session windows co-exist per key which are not yet to evict. This patch handles the case via borrowing state implementation from streaming join which can handle multiple values for given key. ## How was this patch tested? Many UTs are added to verify session window queries for both batch and streaming. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HeartSaVioR/spark SPARK-10816 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22482.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22482 commit a1af74611df7dd5b979fc1a288de96e0b3d415da Author: Jungtaek Lim Date: 2018-09-04T23:10:47Z WIP nothing worked, just recording the progress commit be502485047283e203933a4d78e3b580a0c567df Author: Jungtaek Lim Date: 2018-09-06T04:36:11Z WIP not working yet... lots of implementations needed commit 7c60c0ad922ddacf025ad4762b85d06ab7cb258f Author: Jungtaek Lim Date: 2018-09-06T13:31:08Z WIP Finished implementing UpdatingSessionIterator commit 4e8c260a6e6b73b9bcd347ca242b8e77aedf8d1e Author: Jungtaek Lim Date: 2018-09-07T08:35:32Z WIP add verification on precondition "rows in iterator are sorted by key" commit 39069ded62dc5836b0b0f7c8ec7fb8ce869e5292 Author: Jungtaek Lim Date: 2018-09-08T04:36:46Z Rename SymmetricHashJoinStateManager to MultiValuesStateManager * This will be also used from session window state as well commit c2716340e008000e1fcc5e4d3fcf9befa419ff77 Author: Jungtaek Lim Date: 2018-09-08T04:41:37Z Move package of UpdatingSessionIterator commit df4cffd5fd1ea82be509f1cd97e5fc3a7ef8acb6 Author: Jungtaek Lim Date: 2018-09-10T05:52:28Z WIP add MergingSortWithMultiValuesStateIterator, now integrating with stateful operators (WIP...) commit 79e32b918c3db41c7d6c1c1d55276d3f696746d5 Author: Jungtaek Lim Date: 2018-09-13T06:54:37Z WIP the first version of working one! Still have lots of TODOs and FIXMEs to go commit fb7aa17488e5753c5460f383e1b0f4bedca6dee8 Author: Jungtaek Lim Date: 2018-09-13T08:13:45Z Add more explanations commit 9f41b9d6e7960031c52603bd1da9aeca747e1dfb Author: Jungtaek Lim Date: 2018-09-13T08:49:01Z Silly bugfix & block session window for batch query as of now We can enable it but there're lots of approaches on aggregations in batch side... * AggUtils.planAggregateWithoutDistinct * AggUtils.planAggregateWithOneDistinct * RewriteDistinctAggregates * AggregateInPandasExec So unless we are sure which things to support, just block them for now... commit 0a62b1f0c274859061c0f3ab2c63450052985ac7 Author: Jungtaek Lim Date: 2018-09-13T09:28:34Z More works: majorly split out updating session to individual physical node * we will leverage such node for batch case if we want commit acb5a0c42641041ca3adae2c9f2293b4dfa837cf Author: Jungtaek Lim Date: 2018-09-13T09:38:00Z Fix a silly bug and also add check for session window against batch query commit 1b6502c92231b7aaa9d0d6f620a5bcc624b862ec Author: Jungtaek Lim Date: 2018-09-13T11:30:15Z WIP Fixed eviction on update mode commit fec9a8ae5c1d421322738bd474fcb5508421f51a Author: Jungtaek Lim Date: 2018-09-13T12:48:07Z WIP found root reason of broken UT... fixed it commit c87e4eebcc53c81328d52e4d4ea270bcede8b26e Author: Jungtaek Lim Date: 2018-09-13T12:50:31Z WIP remove printing "explain" on UTs commit c0726d7447ce84440e46013d1cc392f1e397
[GitHub] spark issue #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests fail...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22480 +1 for adding the note --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22443 I see, @cloud-fan . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchm...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22443 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22443 thanks, merging to master! @dongjoon-hyun Can you create an umbrella JIRA for updating all the benchmark and take care of it? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22462: [SPARK-25460][SS] DataSourceV2: SS sources do not...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22462#discussion_r219032068 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala --- @@ -143,15 +185,18 @@ class StreamingDataSourceV2Suite extends StreamTest { Trigger.ProcessingTime(1000), Trigger.Continuous(1000)) - private def testPositiveCase(readFormat: String, writeFormat: String, trigger: Trigger) = { --- End diff -- Yup --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21445: [SPARK-24404][SS] Increase currentEpoch when meet a Epoc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21445 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22481: Revert [SPARK-19355][SPARK-25352]
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22481 LGTM, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21649: [SPARK-23648][R][SQL]Adds more types for hint in ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21649 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/7 long thread, are we all good with this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21649: [SPARK-23648][R][SQL]Adds more types for hint in SparkR
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21649 merged to master, thx --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22475: [SPARK-4502][SQL] Rename to spark.sql.optimizer.n...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22475 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22460 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96313/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22460 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22460: DO NOT MERGE
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22460 **[Test build #96313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96313/testReport)** for PR 22460 at commit [`d930dc7`](https://github.com/apache/spark/commit/d930dc73a7c73d7ce6cab96025c30993af4ea8e7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22173: [SPARK-24355] Spark external shuffle server improvement ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22173 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22475: [SPARK-4502][SQL] Rename to spark.sql.optimizer.nestedSc...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22475 Thanks! Merged to master/2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22173: [SPARK-24355] Spark external shuffle server improvement ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22173 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96307/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030775 --- Diff: R/pkg/R/DataFrame.R --- @@ -244,11 +244,15 @@ setMethod("showDF", #' @note show(SparkDataFrame) since 1.4.0 setMethod("show", "SparkDataFrame", function(object) { -cols <- lapply(dtypes(object), function(l) { - paste(l, collapse = ":") -}) -s <- paste(cols, collapse = ", ") -cat(paste(class(object), "[", s, "]\n", sep = "")) +if (identical(sparkR.conf("spark.sql.repl.eagerEval.enabled", "false")[[1]], "true")) { --- End diff -- also not sure if it's done for python, consider adding to the doc above (L229) how it behaves with eagerEval --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22173: [SPARK-24355] Spark external shuffle server improvement ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22173 **[Test build #96307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96307/testReport)** for PR 22173 at commit [`0348ec8`](https://github.com/apache/spark/commit/0348ec8d5570aab9d744043a3d6a88950f4aeb5c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030350 --- Diff: docs/sparkr.md --- @@ -450,6 +450,42 @@ print(model.summaries) {% endhighlight %} +### Eager execution + +If the eager execution is enabled, the data will be returned to R client immediately when the `SparkDataFrame` is created. Eager execution can be enabled by setting the configuration property `spark.sql.repl.eagerEval.enabled` to `true` when the `SparkSession` is started up. + + +{% highlight r %} + +# Start up spark session with eager execution enabled +sparkR.session(master = "local[*]", sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true")) + +df <- createDataFrame(faithful) + +# Instead of displaying the SparkDataFrame class, displays the data returned +df + +##+-+---+ +##|eruptions|waiting| +##+-+---+ +##| 3.6| 79.0| +##| 1.8| 54.0| +##|3.333| 74.0| +##|2.283| 62.0| +##|4.533| 85.0| +##|2.883| 55.0| +##| 4.7| 88.0| +##| 3.6| 85.0| +##| 1.95| 51.0| +##| 4.35| 85.0| +##+-+---+ +##only showing top 10 rows + +{% endhighlight %} + + +Note that the `SparkSession` created by `sparkR` shell does not have eager execution enabled. You can stop the current session and start up a new session like above to enable. --- End diff -- actually I think the suggestion should be to set that in the `sparkR` command line as spark conf? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030512 --- Diff: R/pkg/tests/fulltests/test_eager_execution.R --- @@ -0,0 +1,58 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +library(testthat) + +context("Show SparkDataFrame when eager execution is enabled.") + +test_that("eager execution is not enabled", { + # Start Spark session without eager execution enabled + sparkSession <- if (windows_with_hadoop()) { +sparkR.session(master = sparkRTestMaster) + } else { +sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) + } + + df <- suppressWarnings(createDataFrame(iris)) --- End diff -- use a different dataset that does not require `suppressWarnings` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030211 --- Diff: docs/sparkr.md --- @@ -450,6 +450,42 @@ print(model.summaries) {% endhighlight %} +### Eager execution + +If the eager execution is enabled, the data will be returned to R client immediately when the `SparkDataFrame` is created. Eager execution can be enabled by setting the configuration property `spark.sql.repl.eagerEval.enabled` to `true` when the `SparkSession` is started up. + + +{% highlight r %} + +# Start up spark session with eager execution enabled +sparkR.session(master = "local[*]", sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true")) + +df <- createDataFrame(faithful) + +# Instead of displaying the SparkDataFrame class, displays the data returned --- End diff -- we could also start here by saying "similar to R data.frame`... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030277 --- Diff: docs/sparkr.md --- @@ -450,6 +450,42 @@ print(model.summaries) {% endhighlight %} +### Eager execution + +If the eager execution is enabled, the data will be returned to R client immediately when the `SparkDataFrame` is created. Eager execution can be enabled by setting the configuration property `spark.sql.repl.eagerEval.enabled` to `true` when the `SparkSession` is started up. + + +{% highlight r %} + +# Start up spark session with eager execution enabled +sparkR.session(master = "local[*]", sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true")) + +df <- createDataFrame(faithful) + +# Instead of displaying the SparkDataFrame class, displays the data returned +df + +##+-+---+ +##|eruptions|waiting| +##+-+---+ +##| 3.6| 79.0| +##| 1.8| 54.0| +##|3.333| 74.0| +##|2.283| 62.0| +##|4.533| 85.0| +##|2.883| 55.0| +##| 4.7| 88.0| +##| 3.6| 85.0| +##| 1.95| 51.0| +##| 4.35| 85.0| +##+-+---+ +##only showing top 10 rows + +{% endhighlight %} + + +Note that the `SparkSession` created by `sparkR` shell does not have eager execution enabled. You can stop the current session and start up a new session like above to enable. --- End diff -- change to `Note that the `SparkSession` created by `sparkR` shell by default does not ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219029847 --- Diff: docs/sparkr.md --- @@ -450,6 +450,42 @@ print(model.summaries) {% endhighlight %} +### Eager execution --- End diff -- should be `` I think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030474 --- Diff: R/pkg/tests/fulltests/test_eager_execution.R --- @@ -0,0 +1,58 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +library(testthat) + +context("Show SparkDataFrame when eager execution is enabled.") + +test_that("eager execution is not enabled", { --- End diff -- I'm neutral, should these tests be in test_sparkSQL.R? it takes longer to run with many test files --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030085 --- Diff: docs/sparkr.md --- @@ -450,6 +450,42 @@ print(model.summaries) {% endhighlight %} +### Eager execution + +If the eager execution is enabled, the data will be returned to R client immediately when the `SparkDataFrame` is created. Eager execution can be enabled by setting the configuration property `spark.sql.repl.eagerEval.enabled` to `true` when the `SparkSession` is started up. + + +{% highlight r %} + +# Start up spark session with eager execution enabled +sparkR.session(master = "local[*]", sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true")) + +df <- createDataFrame(faithful) --- End diff -- perhaps a more complete example - like `summarize(groupBy(df, df$waiting), count = n(df$waiting))` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22455: [SPARK-24572][SPARKR] "eager execution" for R she...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22455#discussion_r219030537 --- Diff: R/pkg/tests/fulltests/test_eager_execution.R --- @@ -0,0 +1,58 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +library(testthat) + +context("Show SparkDataFrame when eager execution is enabled.") + +test_that("eager execution is not enabled", { + # Start Spark session without eager execution enabled + sparkSession <- if (windows_with_hadoop()) { +sparkR.session(master = sparkRTestMaster) + } else { +sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE) + } + + df <- suppressWarnings(createDataFrame(iris)) + expect_is(df, "SparkDataFrame") + expected <- "Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string" + expect_output(show(df), expected) + + # Stop Spark session + sparkR.session.stop() +}) + +test_that("eager execution is enabled", { + # Start Spark session without eager execution enabled + sparkSession <- if (windows_with_hadoop()) { +sparkR.session(master = sparkRTestMaster, + sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true")) + } else { +sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, + sparkConfig = list(spark.sql.repl.eagerEval.enabled = "true")) + } + + df <- suppressWarnings(createDataFrame(iris)) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22379 think maybe someone to review the SQL stuff more? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22481: Revert [SPARK-19355][SPARK-25352]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22481 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3279/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22481: Revert [SPARK-19355][SPARK-25352]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22481 **[Test build #96320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96320/testReport)** for PR 22481 at commit [`4532aaa`](https://github.com/apache/spark/commit/4532aaa2471c04c57f3b59bdcec26ad83627df68). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22481: Revert [SPARK-19355][SPARK-25352]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22481 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22464: Revert [SPARK-19355][SPARK-25352]
Github user viirya closed the pull request at: https://github.com/apache/spark/pull/22464 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22481: Revert [SPARK-19355][SPARK-25352]
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/22481 Revert [SPARK-19355][SPARK-25352] ## What changes were proposed in this pull request? This goes to revert sequential PRs based on some discussion and comments at https://github.com/apache/spark/pull/16677#issuecomment-422650759. #22344 #22330 #22239 #16677 ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 revert-SPARK-19355-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22481.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22481 commit 7b81a95e9496ba953648880a70896085e2bfd043 Author: Liang-Chi Hsieh Date: 2018-09-20T03:14:50Z Revert "[SPARK-25352][SQL] Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold" This reverts commit 2f422398b524eacc89ab58e423bb134ae3ca3941. commit cea66899194f19812ac217cde2d8b6fe1fbe1328 Author: Liang-Chi Hsieh Date: 2018-09-20T03:57:38Z Revert "[SPARK-19355][SQL][FOLLOWUP][TEST] Properly recycle SparkSession on TakeOrderedAndProjectSuite finishes" This reverts commit 3aa60282cc84d471ea32ef240ec84e5b6e3e231b. commit 2dae33e5b897c0ec05f675ec565abee5f2c4ea34 Author: Liang-Chi Hsieh Date: 2018-09-20T03:58:11Z Revert "[SPARK-19355][SQL][FOLLOWUP] Remove the child.outputOrdering check in global limit" This reverts commit 5c27b0d4f8d378bd7889d26fb358f478479b9996. commit 4532aaa2471c04c57f3b59bdcec26ad83627df68 Author: Liang-Chi Hsieh Date: 2018-09-20T04:00:46Z Revert "[SPARK-19355][SQL] Use map output statistics to improve global limit's parallelism" This reverts commit 4f175850985cfc4c64afb90d784bb292e81dc0b7. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22481: Revert [SPARK-19355][SPARK-25352]
Github user viirya commented on the issue: https://github.com/apache/spark/pull/22481 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22480: [SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests fail...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22480 **[Test build #96319 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96319/testReport)** for PR 22480 at commit [`97e95af`](https://github.com/apache/spark/commit/97e95afeba368dd06f747665c41f96a50141305a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22408: [SPARK-25417][SQL] ArrayContains function may ret...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22408#discussion_r219028470 --- Diff: docs/sql-programming-guide.md --- @@ -1879,6 +1879,80 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. + + + +Query + + +Result Spark 2.3 or Prior + + +Result Spark 2.4 + + +Remarks + + + + +SELECT array_contains(array(1), 1.34D); + + +true + + +false + + +In Spark 2.4, both left and right parameters are promoted to array(double) and double type respectively. + + + + +SELECT array_contains(array(1), 1.34); --- End diff -- @cloud-fan OK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org