[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19492 Kindly ping @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19978: [SPARK-22784][CORE][WIP] Configure reading buffer size i...
Github user MikhailErofeev commented on the issue: https://github.com/apache/spark/pull/19978 @srowen, yes, the processing is no longer IO-bound after backporting SPARK-20923 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19978: [SPARK-22784][CORE][WIP] Configure reading buffer...
Github user MikhailErofeev closed the pull request at: https://github.com/apache/spark/pull/19978 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r163472676 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -1313,6 +1331,9 @@ object CodeGenerator extends Logging { // This is the value of HugeMethodLimit in the OpenJDK JVM settings val DEFAULT_JVM_HUGE_METHOD_LIMIT = 8000 + // The max valid length of method parameters in JVM. + val MAX_JVM_METHOD_PARAMS_LENGTH = 255 --- End diff -- make it `final`? I think we can add final to all the constants here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20376 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20377 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86562/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20377 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20376 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/179/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20376 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20377 **[Test build #86562 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86562/testReport)** for PR 20377 at commit [`0574ec7`](https://github.com/apache/spark/commit/0574ec71d830d8c4b5acdbfc910ddc0f5622b0ed). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20376 **[Test build #86561 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86561/testReport)** for PR 20376 at commit [`dfd828b`](https://github.com/apache/spark/commit/dfd828b718bcda85d2b822c8a96ae593c14850e8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20376 **[Test build #86574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86574/testReport)** for PR 20376 at commit [`dfd828b`](https://github.com/apache/spark/commit/dfd828b718bcda85d2b822c8a96ae593c14850e8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20376 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20376 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86561/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17702 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/178/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17702 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/17702 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17702 **[Test build #86573 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86573/testReport)** for PR 17702 at commit [`dc373ae`](https://github.com/apache/spark/commit/dc373ae68040b92e7b67c99ce0c793c80b5ab507). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20380: [SPARK-20906][SPARKR] Add API doc example for Constraine...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20380 **[Test build #86572 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86572/testReport)** for PR 20380 at commit [`f56b00d`](https://github.com/apache/spark/commit/f56b00d777eb27ff675bf4e127c83e004a9e34ff). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20379: [SPARK-23177][SQL][PySpark][Backport-2.3] Extract zero-p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20379 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/177/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20380: [SPARK-20906][SPARKR] Add API doc example for Constraine...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20380 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20379: [SPARK-23177][SQL][PySpark][Backport-2.3] Extract zero-p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20379 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20380: [SPARK-20906][SPARKR] Add API doc example for Constraine...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20380 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/176/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20380: [SPARK-20906][SPARKR] Add API doc example for Con...
GitHub user felixcheung opened a pull request: https://github.com/apache/spark/pull/20380 [SPARK-20906][SPARKR] Add API doc example for Constrained Logistic Regression ## What changes were proposed in this pull request? doc only changes ## How was this patch tested? manual You can merge this pull request into a Git repository by running: $ git pull https://github.com/felixcheung/spark rclrdoc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20380.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20380 commit f56b00d777eb27ff675bf4e127c83e004a9e34ff Author: Felix CheungDate: 2018-01-24T07:14:20Z api doc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20379: [SPARK-23177][SQL][PySpark][Backport-2.3] Extract zero-p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20379 **[Test build #86571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86571/testReport)** for PR 20379 at commit [`a66b5e0`](https://github.com/apache/spark/commit/a66b5e0c4b81444974f02c7154111b47a1a5137c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20379: [SPARK-23177][SQL][PySpark][Backport-2.3] Extract zero-p...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20379 cc @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20379: [SPARK-23177][SQL][PySpark][Backport-2.3] Extract...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/20379 [SPARK-23177][SQL][PySpark][Backport-2.3] Extract zero-parameter UDFs from aggregate ## What changes were proposed in this pull request? We extract Python UDFs in logical aggregate which depends on aggregate expression or grouping key in ExtractPythonUDFFromAggregate rule. But Python UDFs which don't depend on above expressions should also be extracted to avoid the issue reported in the JIRA. A small code snippet to reproduce that issue looks like: ```python import pyspark.sql.functions as f df = spark.createDataFrame([(1,2), (3,4)]) f_udf = f.udf(lambda: str("const_str")) df2 = df.distinct().withColumn("a", f_udf()) df2.show() ``` Error exception is raised as: ``` : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#50 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513) ``` This exception raises because `HashAggregateExec` tries to bind the aliased Python UDF expression (e.g., `pythonUDF0#50 AS a#44`) to grouping key. ## How was this patch tested? Added test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-23177-backport-2.3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20379.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20379 commit a66b5e0c4b81444974f02c7154111b47a1a5137c Author: Liang-Chi HsiehDate: 2018-01-24T06:50:53Z Extract parameter-less UDFs from aggregate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20338: [SPARK-23174][BUILD][PYTHON] python code style ch...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20338#discussion_r163467168 --- Diff: dev/tox.ini --- @@ -13,7 +13,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -[pep8] -ignore=E402,E731,E241,W503,E226 +[pycodestyle] +ignore=E402,E731,E241,W503,E226,E722,E741,E305 --- End diff -- Yea .. there look actually many instances about it. I manually ran it against this PR. Will maybe take this one and open a PR separately after double checking each rule. Didn't take a look carefully yet. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20338: [SPARK-23174][BUILD][PYTHON] python code style ch...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20338#discussion_r163465854 --- Diff: dev/tox.ini --- @@ -13,7 +13,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -[pep8] -ignore=E402,E731,E241,W503,E226 +[pycodestyle] +ignore=E402,E731,E241,W503,E226,E722,E741,E305 --- End diff -- should we fix the code that violates E722: "do not use bare except'" E741: "ambiguous variable name 'l'", E305: "expected 2 blank lines after class or function definition" ? Or that's another PR/JIRA? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20372: Improved block merging logic for partitions
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/20372 Please fix the scala style checks -- ``` Running Scala style checks Scalastyle checks failed at following occurrences: [error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:459: File line length exceeds 100 characters [error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:463: File line length exceeds 100 characters [error] Total time: 14 s, completed Jan 23, 2018 10:44:36 PM [error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received return code 1 ``` and verify locally with `./dev/lint-scala` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20367: [SPARK-23166][ML] Add maxDF Parameter to CountVec...
Github user ymazari commented on a diff in the pull request: https://github.com/apache/spark/pull/20367#discussion_r163465302 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala --- @@ -119,6 +119,41 @@ class CountVectorizerSuite extends SparkFunSuite with MLlibTestSparkContext } } + test("CountVectorizer maxDF") { +val df = Seq( + (0, split("a b c d"), Vectors.sparse(3, Seq((0, 1.0), (1, 1.0), (2, 1.0, + (1, split("a b c"), Vectors.sparse(3, Seq((0, 1.0), (1, 1.0, + (2, split("a b"), Vectors.sparse(3, Seq((0, 1.0, + (3, split("a"), Vectors.sparse(3, Seq())) +).toDF("id", "words", "expected") + +// maxDF: ignore terms with count more than 3 +val cvModel = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setMaxDF(3) + .fit(df) +assert(cvModel.vocabulary === Array("b", "c", "d")) + +cvModel.transform(df).select("features", "expected").collect().foreach { + case Row(features: Vector, expected: Vector) => +assert(features ~== expected absTol 1e-14) +} + +// maxDF: ignore terms with freq > 0.75 +val cvModel2 = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setMaxDF(0.75) + .fit(df) +assert(cvModel2.vocabulary === Array("b", "c", "d")) + +cvModel2.transform(df).select("features", "expected").collect().foreach { + case Row(features: Vector, expected: Vector) => +assert(features ~== expected absTol 1e-14) --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18931 **[Test build #86570 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86570/testReport)** for PR 18931 at commit [`79d0106`](https://github.com/apache/spark/commit/79d010614845a2663353c5e84ff46e2784e3f7b2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18931 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20338 **[Test build #86569 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86569/testReport)** for PR 20338 at commit [`34a8590`](https://github.com/apache/spark/commit/34a8590b7fa205be3d2b3f4913e252ab0e96d091). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18931 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/175/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20338 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20338 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/174/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20372: Improved block merging logic for partitions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20372 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20372: Improved block merging logic for partitions
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20372 **[Test build #86567 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86567/testReport)** for PR 20372 at commit [`ef04de9`](https://github.com/apache/spark/commit/ef04de9766584b0a8ab13f290c9850e44570). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20372: Improved block merging logic for partitions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20372 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86567/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20338 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20372: Improved block merging logic for partitions
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20372 **[Test build #86567 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86567/testReport)** for PR 20372 at commit [`ef04de9`](https://github.com/apache/spark/commit/ef04de9766584b0a8ab13f290c9850e44570). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18931 **[Test build #86568 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86568/testReport)** for PR 18931 at commit [`9f0d1da`](https://github.com/apache/spark/commit/9f0d1da8e6f13e61411755d01e74e86cc399d2e8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20372: Improved block merging logic for partitions
Github user ash211 commented on a diff in the pull request: https://github.com/apache/spark/pull/20372#discussion_r163464207 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -445,16 +445,25 @@ case class FileSourceScanExec( currentSize = 0 } -// Assign files to partitions using "Next Fit Decreasing" -splitFiles.foreach { file => - if (currentSize + file.length > maxSplitBytes) { +def addFile(file: PartitionedFile): Unit = { +currentFiles += file +currentSize += file.length + openCostInBytes +} + +var frontIndex = 0 +var backIndex = splitFiles.length - 1 + +while (frontIndex <= backIndex) { +while (frontIndex <= backIndex && currentSize + splitFiles(frontIndex).length <= maxSplitBytes) { +addFile(splitFiles(frontIndex)) +frontIndex += 1 +} +while (backIndex > frontIndex && currentSize + splitFiles(backIndex).length <= maxSplitBytes) { +addFile(splitFiles(backIndex)) +backIndex -= 1 +} closePartition() - } - // Add the given file to the current partition. - currentSize += file.length + openCostInBytes --- End diff -- saw you added a commit to handle the large non-splittable files case -- can you please add a test for that also? want to make sure this while loop never becomes an infinite loop! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18931 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18931 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/173/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20338 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20338 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86560/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20338 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86559/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20368 Thanks! revert it from master/2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20338 **[Test build #86560 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86560/testReport)** for PR 20338 at commit [`34a8590`](https://github.com/apache/spark/commit/34a8590b7fa205be3d2b3f4913e252ab0e96d091). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20338 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20338 **[Test build #86559 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86559/testReport)** for PR 20338 at commit [`d10fbb4`](https://github.com/apache/spark/commit/d10fbb49cc6b062bfa7d463f0bae098c77b9a890). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20372: Improved block merging logic for partitions
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/20372 Jenkins, this is ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20037: [SPARK-22849] ivy.retrieve pattern should also co...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/20037#discussion_r163463718 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -1271,7 +1271,7 @@ private[spark] object SparkSubmitUtils { // retrieve all resolved dependencies ivy.retrieve(rr.getModuleDescriptor.getModuleRevisionId, packagesDirectory.getAbsolutePath + File.separator + -"[organization]_[artifact]-[revision].[ext]", +"[organization]_[artifact]-[revision](-[classifier]).[ext]", --- End diff -- I tried it today. Somehow, I only got the test jar downloaded. Have you guys seen this issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20352: [SPARK-21727][R] Allow multi-element atomic vector as co...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20352 merged to master/2.3. we could revisit migration guide if necessary. thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20368 Oh, thank you so much for fast recovery, @gatorsmile . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20352: [SPARK-21727][R] Allow multi-element atomic vecto...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20352 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18931 **[Test build #86566 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86566/testReport)** for PR 18931 at commit [`2f2d1fd`](https://github.com/apache/spark/commit/2f2d1fd6a5844465e3d75b61c0e8b19fa93ada3e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20368 I am reverting this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18931 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18931 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/172/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r163462731 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -149,13 +149,100 @@ trait CodegenSupport extends SparkPlan { ctx.freshNamePrefix = parent.variablePrefix val evaluated = evaluateRequiredVariables(output, inputVars, parent.usedInputs) + +// Under certain conditions, we can put the logic to consume the rows of this operator into +// another function. So we can prevent a generated function too long to be optimized by JIT. +// The conditions: +// 1. The parent uses all variables in output. we can't defer variable evaluation when consume +//in another function. +// 2. The output variables are not empty. If it's empty, we don't bother to do that. +// 3. We don't use row variable. The construction of row uses deferred variable evaluation. We +//can't do it. +// 4. The number of output variables must less than maximum number of parameters in Java method +//declaration. +val requireAllOutput = output.forall(parent.usedInputs.contains(_)) +val consumeFunc = + if (row == null && outputVars.nonEmpty && requireAllOutput && isValidParamLength(ctx)) { +constructDoConsumeFunction(ctx, inputVars) + } else { +parent.doConsume(ctx, inputVars, rowVar) + } s""" |${ctx.registerComment(s"CONSUME: ${parent.simpleString}")} |$evaluated - |${parent.doConsume(ctx, inputVars, rowVar)} + |$consumeFunc """.stripMargin } + /** + * In Java, a method descriptor is valid only if it represents method parameters with a total + * length of 255 or less. `this` contributes one unit and a parameter of type long or double + * contributes two units. Besides, for nullable parameters, we also need to pass a boolean + * for the null status. + */ + private def isValidParamLength(ctx: CodegenContext): Boolean = { --- End diff -- Yes. Put it in `CodegenContext`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r163462688 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -149,13 +149,100 @@ trait CodegenSupport extends SparkPlan { ctx.freshNamePrefix = parent.variablePrefix val evaluated = evaluateRequiredVariables(output, inputVars, parent.usedInputs) + +// Under certain conditions, we can put the logic to consume the rows of this operator into +// another function. So we can prevent a generated function too long to be optimized by JIT. +// The conditions: +// 1. The parent uses all variables in output. we can't defer variable evaluation when consume +//in another function. +// 2. The output variables are not empty. If it's empty, we don't bother to do that. +// 3. We don't use row variable. The construction of row uses deferred variable evaluation. We +//can't do it. +// 4. The number of output variables must less than maximum number of parameters in Java method +//declaration. +val requireAllOutput = output.forall(parent.usedInputs.contains(_)) +val consumeFunc = + if (row == null && outputVars.nonEmpty && requireAllOutput && isValidParamLength(ctx)) { +constructDoConsumeFunction(ctx, inputVars) + } else { +parent.doConsume(ctx, inputVars, rowVar) + } s""" |${ctx.registerComment(s"CONSUME: ${parent.simpleString}")} |$evaluated - |${parent.doConsume(ctx, inputVars, rowVar)} + |$consumeFunc """.stripMargin } + /** + * In Java, a method descriptor is valid only if it represents method parameters with a total + * length of 255 or less. `this` contributes one unit and a parameter of type long or double + * contributes two units. Besides, for nullable parameters, we also need to pass a boolean + * for the null status. + */ + private def isValidParamLength(ctx: CodegenContext): Boolean = { +// Start value is 1 for `this`. +output.foldLeft(1) { case (curLength, attr) => + ctx.javaType(attr.dataType) match { +case (ctx.JAVA_LONG | ctx.JAVA_DOUBLE) if !attr.nullable => curLength + 2 +case ctx.JAVA_LONG | ctx.JAVA_DOUBLE => curLength + 3 +case _ if !attr.nullable => curLength + 1 +case _ => curLength + 2 + } +} <= 255 + } + + /** + * To prevent concatenated function growing too long to be optimized by JIT. We can separate the + * parent's `doConsume` codes of a `CodegenSupport` operator into a function to call. + */ + private def constructDoConsumeFunction( + ctx: CodegenContext, + inputVars: Seq[ExprCode]): String = { +val (callingParams, arguList, inputVarsInFunc) = + constructConsumeParameters(ctx, output, inputVars) +val rowVar = ExprCode("", "false", "unsafeRow") +val doConsume = ctx.freshName("doConsume") +val doConsumeFuncName = ctx.addNewFunction(doConsume, + s""" + | private void $doConsume($arguList) throws java.io.IOException { + | ${parent.doConsume(ctx, inputVarsInFunc, rowVar)} + | } + """.stripMargin) + +s""" + | $doConsumeFuncName($callingParams); + """.stripMargin + } + + /** + * Returns source code for calling consume function and the argument list of the consume function + * and also the `ExprCode` for the argument list. + */ + private def constructConsumeParameters( + ctx: CodegenContext, + attributes: Seq[Attribute], + variables: Seq[ExprCode]): (String, String, Seq[ExprCode]) = { +val params = variables.zipWithIndex.map { case (ev, i) => + val arguName = ctx.freshName(s"expr_$i") + val arguType = ctx.javaType(attributes(i).dataType) + + val (callingParam, funcParams, arguIsNull) = if (!attributes(i).nullable) { +// When the argument is not nullable, we don't need to pass in `isNull` param for it and +// simply give a `false`. +val arguIsNull = "false" +(ev.value, s"$arguType $arguName", arguIsNull) + } else { +val arguIsNull = ctx.freshName(s"exprIsNull_$i") +(ev.value + ", " + ev.isNull, s"$arguType $arguName, boolean $arguIsNull", arguIsNull) + } + (callingParam, funcParams, ExprCode("", arguIsNull, arguName)) +}.unzip3 +(params._1.mkString(", "), + params._2.mkString(", "), + params._3) --- End diff -- Done. --- - To
[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18931#discussion_r163462698 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -149,13 +149,100 @@ trait CodegenSupport extends SparkPlan { ctx.freshNamePrefix = parent.variablePrefix val evaluated = evaluateRequiredVariables(output, inputVars, parent.usedInputs) + +// Under certain conditions, we can put the logic to consume the rows of this operator into +// another function. So we can prevent a generated function too long to be optimized by JIT. +// The conditions: +// 1. The parent uses all variables in output. we can't defer variable evaluation when consume +//in another function. +// 2. The output variables are not empty. If it's empty, we don't bother to do that. +// 3. We don't use row variable. The construction of row uses deferred variable evaluation. We +//can't do it. +// 4. The number of output variables must less than maximum number of parameters in Java method +//declaration. --- End diff -- Added a config for it so we can turn it off. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18756: [SPARK-21548][SQL] "Support insert into serial columns o...
Github user lvdongr commented on the issue: https://github.com/apache/spark/pull/18756 Execute me ,has the concept of default value been introduce to schema in master branch? @gatorsmile thank you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...
Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/20224 BTW, inspired by @cloud-fan 's comment, here's an example of the codegen stage IDs when scalar subqueries are involved: ```scala val sub = "(select sum(id) from range(5))" val df = spark.sql(s"select $sub as a, $sub as b") df.explain(true) ``` would give: ``` == Parsed Logical Plan == 'Project [scalar-subquery#0 [] AS a#1, scalar-subquery#2 [] AS b#3] : :- 'Project [unresolvedalias('sum('id), None)] : : +- 'UnresolvedTableValuedFunction range, [5] : +- 'Project [unresolvedalias('sum('id), None)] : +- 'UnresolvedTableValuedFunction range, [5] +- OneRowRelation == Analyzed Logical Plan == a: bigint, b: bigint Project [scalar-subquery#0 [] AS a#1L, scalar-subquery#2 [] AS b#3L] : :- Aggregate [sum(id#14L) AS sum(id)#16L] : : +- Range (0, 5, step=1, splits=None) : +- Aggregate [sum(id#17L) AS sum(id)#19L] : +- Range (0, 5, step=1, splits=None) +- OneRowRelation == Optimized Logical Plan == Project [scalar-subquery#0 [] AS a#1L, scalar-subquery#2 [] AS b#3L] : :- Aggregate [sum(id#14L) AS sum(id)#16L] : : +- Range (0, 5, step=1, splits=None) : +- Aggregate [sum(id#17L) AS sum(id)#19L] : +- Range (0, 5, step=1, splits=None) +- OneRowRelation == Physical Plan == *(1) Project [Subquery subquery0 AS a#1L, Subquery subquery2 AS b#3L] : :- Subquery subquery0 : : +- *(2) HashAggregate(keys=[], functions=[sum(id#14L)], output=[sum(id)#16L]) : : +- Exchange SinglePartition : :+- *(1) HashAggregate(keys=[], functions=[partial_sum(id#14L)], output=[sum#21L]) : : +- *(1) Range (0, 5, step=1, splits=8) : +- Subquery subquery2 : +- *(2) HashAggregate(keys=[], functions=[sum(id#17L)], output=[sum(id)#19L]) :+- Exchange SinglePartition : +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#17L)], output=[sum#23L]) : +- *(1) Range (0, 5, step=1, splits=8) +- Scan OneRowRelation[] ``` The reason why the IDs look a bit "odd" (that there are three separate codegen stages with ID 1) is because the main "spine" query and each individual subqueries are "planned" separately, thus they'd run `CollapseCodegenStages` separately, each counting up from 1 afresh. I would consider this behavior acceptable, but I wonder what others would think in this case. If this behavior for subqueries is not acceptable, I'll have to find alternative places to put the initialization and reset of the thread-local ID counter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19285: [SPARK-22068][CORE]Reduce the duplicate code betw...
Github user ConeyLiu commented on a diff in the pull request: https://github.com/apache/spark/pull/19285#discussion_r163462053 --- Diff: core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala --- @@ -233,17 +235,13 @@ private[spark] class MemoryStore( } if (keepUnrolling) { - // We successfully unrolled the entirety of this block - val arrayValues = vector.toArray - vector = null - val entry = -new DeserializedMemoryEntry[T](arrayValues, SizeEstimator.estimate(arrayValues), classTag) - val size = entry.size + // We need more precise value + val size = valuesHolder.esitimatedSize(false) --- End diff -- I change the code back to originally. For `DeserializedValuesHolder`, we could `buildEntry` and get the `size` from `MemorySize`. But for `SerializedValuesHolder`, this way not work correctly. Because we need call the `bbos.toChunkedByteBuffer` to get the `MemoryEntry` object, and if the reserved memory is not enough for transfer the unroll memory to storage memory. Then we unroll failed and need call `bbos.toChunkedByteBuffer` ([L802](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L802), this should be intentional which related to #15043). So the problem is that we call `bbos.toChunkedByteBuffer` twice, but it can't be. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...
Github user rednaxelafx commented on a diff in the pull request: https://github.com/apache/spark/pull/20224#discussion_r163461361 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -365,6 +388,26 @@ case class WholeStageCodegenExec(child: SparkPlan) extends UnaryExecNode with Co "pipelineTime" -> SQLMetrics.createTimingMetric(sparkContext, WholeStageCodegenExec.PIPELINE_DURATION_METRIC)) + /** + * ID for codegen stages within a query plan. + * It does not affect equality, nor does it participate in destructuring pattern matching. + * + * Within a query, a codegen stage in a plan starts counting from 1, in "insertion order". + * WholeStageCodegenExec operators are inserted into a plan in depth-first post-order. + * See CollapseCodegenStages.insertWholeStageCodegen for the definition of insertion order. + * + * 0 is reserved as a special ID value to indicate a temporary WholeStageCodegenExec object + * is created, e.g. for special fallback handling when an existing WholeStageCodegenExec + * failed to generate/compile code. + */ + val codegenStageId = WholeStageCodegenId.getNextStageId() --- End diff -- Thanks for your comment, @cloud-fan ! That's a nice catch that I hadn't really thought about. All the examples that I've run with are ones that wouldn't trigger changes to the plan after `CollapseCodegenStages`, which means for those examples `ReuseExchange` / `ReuseSubqueries` wouldn't have triggered. Yes, these two rules could potentially change the physical plan, which means when we `transformUp` in those rules (and any future rule after `CollapseCodegenStages`) it'd create new `WholeStageCodegenExec` objects outside of `CollapseCodegenStages`, and with my current implementation that'll result in the WSC copies having a codegen stage ID of 0. One way to workaround this is to move `CollapseCodegenStages` to always be the last rule in `org.apache.spark.sql.execution.QueryExecution#preparations`, so that we're sure there's no other transformation on the physical plan that could change the structure of the plan, except for fallback handling that could happen in a couple of `doExecute()`s -- these exception cases are expected and to me they are acceptable. If we go down that route, I'll probably have to tweak `CollapseCodegenStages` a little bit so that it can cope with the physical query plan potentially becoming a DAG instead of a tree, as the `ReuseExchange` / `ReuseSubqueries` rules may do that kind of transformation. This tweak is easy to implement and low risk: simply bailing out of the transforming a subtree when it sees a `WholeStageCodegenExec` already inserted into the plan would suffice. Let me actually update the PR with this tweak and see what happens in tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19285 **[Test build #86565 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86565/testReport)** for PR 19285 at commit [`f392217`](https://github.com/apache/spark/commit/f392217e77f3b7f29f3f0728c29a2e041adb0c0e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20355: [SPARK-23148][SQL] Allow pathnames with special characte...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20355 Please fix the PR title. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20368 As a result, this seems to block SparkPullRequestBuilder, too. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/SparkPullRequestBuilder/86557/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20368 Hi, All. This seems to break both `master` and `branch-2.3`. It might be correlated to the sister PR, [SPARK-23192](https://github.com/apache/spark/commit/613c290336e382664c24319f66774b1f65a3). Could you take a look? - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4072/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4124/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4402/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4454/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20355: [SPARK-23148][SQL] Allow pathnames with special characte...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20355 **[Test build #86564 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86564/testReport)** for PR 20355 at commit [`51b0db5`](https://github.com/apache/spark/commit/51b0db5272858995a3b3e389a851f4e4aa04c651). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20355: [SPARK-23148][SQL] Allow pathnames with special characte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20355 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20355: [SPARK-23148][SQL] Allow pathnames with special characte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20355 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/171/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17702 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17702 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86558/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17702 **[Test build #86558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86558/testReport)** for PR 17702 at commit [`dc373ae`](https://github.com/apache/spark/commit/dc373ae68040b92e7b67c99ce0c793c80b5ab507). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20169: [SPARK-17088][hive] Fix 'sharesHadoopClasses' option whe...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20169 To be clear, I think I personally usually cc related guys. Yup, if I were fixing this codes, I would have cc'ed you @gatorsmile because I agree that it's good to do. Sure, more eyes are better. Sure, I think I am personally trying to support you and other committers to review and roll it better. I left a comment only because It sounded to cc few specific committers as a requirement because it's not quite what has been on my mind. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20378 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20378 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86563/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20378 **[Test build #86563 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86563/testReport)** for PR 20378 at commit [`e69827e`](https://github.com/apache/spark/commit/e69827e17349196382b24ef65b843d0fbc76cc76). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20378 **[Test build #86563 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86563/testReport)** for PR 20378 at commit [`e69827e`](https://github.com/apache/spark/commit/e69827e17349196382b24ef65b843d0fbc76cc76). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20378 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20378 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/170/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20224#discussion_r163456017 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala --- @@ -228,4 +229,21 @@ class WholeStageCodegenSuite extends QueryTest with SharedSQLContext { } } } + + test("including codegen stage ID in generated class name should not regress codegen caching") { +import testImplicits._ + +withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME.key -> "true") { + val bytecodeSizeHisto = CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE + spark.range(3).select('id + 2).collect + val after1 = bytecodeSizeHisto.getCount + spark.range(3).select('id + 2).collect + val after2 = bytecodeSizeHisto.getCount // same query shape as above, deliberately + assert(after1 == after2, "the same query run twice should hit the codegen cache") + + spark.range(5).select('id * 2).collect + val after3 = bytecodeSizeHisto.getCount + assert(after3 >= after2, "a different query can result in codegen cache miss, that's okay") --- End diff -- makes sense --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20378: [SPARK-11222][Build][Python] Python document styl...
GitHub user rekhajoshm opened a pull request: https://github.com/apache/spark/pull/20378 [SPARK-11222][Build][Python] Python document style checker added ## What changes were proposed in this pull request? Using pydocstyle for python document style checker https://github.com/PyCQA/pycodestyle/issues/723 ## How was this patch tested? ./dev/run-tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/rekhajoshm/spark SPARK-11222-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20378.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20378 commit e3677c9fa9697e0d34f9df52442085a6a481c9e9 Author: Rekha JoshiDate: 2015-05-05T23:10:08Z Merge pull request #1 from apache/master Pulling functionality from apache spark commit 106fd8eee8f6a6f7c67cfc64f57c1161f76d8f75 Author: Rekha Joshi Date: 2015-05-08T21:49:09Z Merge pull request #2 from apache/master pull latest from apache spark commit 0be142d6becba7c09c6eba0b8ea1efe83d649e8c Author: Rekha Joshi Date: 2015-06-22T00:08:08Z Merge pull request #3 from apache/master Pulling functionality from apache spark commit 6c6ee12fd733e3f9902e10faf92ccb78211245e3 Author: Rekha Joshi Date: 2015-09-17T01:03:09Z Merge pull request #4 from apache/master Pulling functionality from apache spark commit b123c601e459d1ad17511fd91dd304032154882a Author: Rekha Joshi Date: 2015-11-25T18:50:32Z Merge pull request #5 from apache/master pull request from apache/master commit c73c32aadd6066e631956923725a48d98a18777e Author: Rekha Joshi Date: 2016-03-18T19:13:51Z Merge pull request #6 from apache/master pull latest from apache spark commit 7dbf7320057978526635bed09dabc8cf8657a28a Author: Rekha Joshi Date: 2016-04-05T20:26:40Z Merge pull request #8 from apache/master pull latest from apache spark commit 5e9d71827f8e2e4d07027281b80e4e073e7fecd1 Author: Rekha Joshi Date: 2017-05-01T23:00:30Z Merge pull request #9 from apache/master Pull apache spark commit 63d99b3ce5f222d7126133170a373591f0ac67dd Author: Rekha Joshi Date: 2017-09-30T22:26:44Z Merge pull request #10 from apache/master pull latest apache spark commit a7fc787466b71784ff86f9694f617db0f1042da8 Author: Rekha Joshi Date: 2018-01-21T00:17:58Z Merge pull request #11 from apache/master Apache spark pull latest commit 88734413c5e33803b7d5f8c0f81899ed1d3577ff Author: rjoshi2 Date: 2018-01-21T03:40:59Z [SPARK-11222][BUILD][PYTHON] python code style checker update commit 871889113057f819337bd24379cf1f07516c3298 Author: rjoshi2 Date: 2018-01-22T01:36:42Z updated for review comment commit c386a9a9b130e3974dc756b0fa89b7cff93f09ac Author: rjoshi2 Date: 2018-01-22T02:26:56Z [SPARK-23174][BUILD][PYTHON] python code style checker update commit d10fbb49cc6b062bfa7d463f0bae098c77b9a890 Author: rjoshi2 Date: 2018-01-24T03:17:05Z updated nit commit 34a8590b7fa205be3d2b3f4913e252ab0e96d091 Author: rjoshi2 Date: 2018-01-24T03:27:42Z updated for review comment commit 428fa093a8385eb977535cc7e4848ffe113dc9f3 Author: rjoshi2 Date: 2018-01-24T05:16:03Z [SPARK-11222][Build][Python] Doc style checker commit aab86ab987c01388bcda0412162fb3cf50593ea7 Author: rjoshi2 Date: 2018-01-24T05:18:27Z [SPARK-11222][Build][Python] Doc style checker updates commit e69827e17349196382b24ef65b843d0fbc76cc76 Author: rjoshi2 Date: 2018-01-24T05:20:11Z [SPARK-11222][Build][Python] Doc style checker updates --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20355: SPARK-23148: [SQL] Allow pathnames with special c...
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/20355#discussion_r163455727 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -78,4 +78,20 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext { } } } + + // Separate test case for text-based formats that support multiLine as an option. + Seq("json", "csv", "text").foreach { format => --- End diff -- That sounds good - my main concern is to make sure that both `multiLine=true` _and_ `multiLine=false` have coverage with a space in the name, since they are such different paths. I'll keep the change that adds a space to `nameWithSpecialChars`, but otherwise have the tests as you suggest - let me know what you think of the next patch! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20377 cc @vanzin @cloud-fan @felixcheung @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19285: [SPARK-22068][CORE]Reduce the duplicate code betw...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19285#discussion_r163455326 --- Diff: core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala --- @@ -702,6 +645,76 @@ private[spark] class MemoryStore( } } +private trait ValuesHolder[T] { + def storeValue(value: T): Unit + def estimatedSize(): Long + def buildEntry(): MemoryEntry[T] +} + +/** + * A holder for storing the deserialized values. + */ +private class DeserializedValuesHolder[T] (classTag: ClassTag[T]) extends ValuesHolder[T] { + // Underlying vector for unrolling the block + var vector = new SizeTrackingVector[T]()(classTag) + var arrayValues: Array[T] = null + var preciseSize: Long = -1 --- End diff -- it can be a local variable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20169: [SPARK-17088][hive] Fix 'sharesHadoopClasses' option whe...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20169 Let me explain my point here. I am not trying to offend anybody. Spark is becoming more and more complex. Peer review and test cases are the major methods we used for quality control. Spark is an infrastructure software, instead of an application software. We really need to be careful when reviewing the PRs. Nobody's codes are perfect. For example, if I made a change in SHS, I will ping @vanzin for sure, because @vanzin is the original author and the most active committer of SHS. Personally, I am doing my best to review the changes in the SQL. It would be nice if you can ping me or @cloud-fan . Since Spark 2.3 release is the recent focus, the current focus of the whole community is for testing and reviewing the changes of Spark 2.3 release. Regarding this PR, I think we do not need to introduce new parameter `warehouseDir`, which is already contained in `hadoopConf `. Keeping the interface clean is always good, I believe. Please review my follow-up PR: https://github.com/apache/spark/pull/20377 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20377 **[Test build #86562 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86562/testReport)** for PR 20377 at commit [`0574ec7`](https://github.com/apache/spark/commit/0574ec71d830d8c4b5acdbfc910ddc0f5622b0ed). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20377 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20377 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/169/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/20377 [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' option when creating client ## What changes were proposed in this pull request? This PR is to remove useless `warehouseDir`, which is already contained in `hadoopConf ` ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark followUp17088 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20377.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20377 commit 0574ec71d830d8c4b5acdbfc910ddc0f5622b0ed Author: gatorsmileDate: 2018-01-24T05:05:47Z remove warehousePath --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19285 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86557/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19285 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19285 **[Test build #86557 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86557/testReport)** for PR 19285 at commit [`c988762`](https://github.com/apache/spark/commit/c9887625e89fa284ddf5a1ca115fe012580d1da8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...
Github user rednaxelafx commented on a diff in the pull request: https://github.com/apache/spark/pull/20224#discussion_r163453639 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala --- @@ -228,4 +229,21 @@ class WholeStageCodegenSuite extends QueryTest with SharedSQLContext { } } } + + test("including codegen stage ID in generated class name should not regress codegen caching") { +import testImplicits._ + +withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME.key -> "true") { + val bytecodeSizeHisto = CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE + spark.range(3).select('id + 2).collect + val after1 = bytecodeSizeHisto.getCount + spark.range(3).select('id + 2).collect + val after2 = bytecodeSizeHisto.getCount // same query shape as above, deliberately + assert(after1 == after2, "the same query run twice should hit the codegen cache") + + spark.range(5).select('id * 2).collect + val after3 = bytecodeSizeHisto.getCount + assert(after3 >= after2, "a different query can result in codegen cache miss, that's okay") --- End diff -- I actually deliberately wrote it this way. Note how I phrased in the assertion message as "can result in codegen cache miss" instead of "will result in". That's because the code shape of this third query was deliberately chosen to be similar to the two queries before it: all three have `spark.range(some_const).select(some_expr).collect`, so if any future changes to codegen of `Range` or `Project` operators affect how much specialized code (such as constant values) we directly embed into the code, it's actually possible to this third query to generate the same code as the first two, which will result in a codegen cache miss. So I'm making this check a bit loose. It's just there to indicate that it's acceptable to for a different query to encounter a codegen cache miss. WYDT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org