[GitHub] spark pull request #21092: [SPARK-23984][K8S] Initial Python Bindings for Py...
Github user ifilonenko commented on a diff in the pull request: https://github.com/apache/spark/pull/21092#discussion_r186329528 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile --- @@ -0,0 +1,34 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +ARG base_img +FROM $base_img +WORKDIR / +RUN mkdir ${SPARK_HOME}/python +COPY python/lib ${SPARK_HOME}/python/lib +RUN apk add --no-cache python && \ +apk add --no-cache python3 && \ +python -m ensurepip && \ +python3 -m ensurepip && \ +rm -r /usr/lib/python*/ensurepip && \ +pip install --upgrade pip setuptools && \ --- End diff -- Correct. Would love recommendations on dependency management in regards to âpipâ as itâs tricky to allow for both pip installation and pip3 installation. Unless I use two separate virtual environments for dependency management --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21231: [SPARK-24119][SQL]Add interpreted execution to SortPrefi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21231 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90288/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21231: [SPARK-24119][SQL]Add interpreted execution to SortPrefi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21231 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21252: [SPARK-24193] Sort by disk when number of limit is big i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21252 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2979/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21231: [SPARK-24119][SQL]Add interpreted execution to SortPrefi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21231 **[Test build #90288 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90288/testReport)** for PR 21231 at commit [`dc9a070`](https://github.com/apache/spark/commit/dc9a0702584c9fd16a1d25c675aaf2b424cd7623). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21252: [SPARK-24193] Sort by disk when number of limit is big i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21252 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21252: [SPARK-24193] Sort by disk when number of limit is big i...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/21252 > Instead of touching inside of TakeOrderedAndProjectExec, how about we don't replace Sort + Limit with TakeOrderedAndProjectExec when reaching the threshold? Yes, the code will be much cleaner. I updated the change. Note that all data will still be sorted if above the threshold and all data will be within one partition after the limit operator --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21252: [SPARK-24193] Sort by disk when number of limit is big i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21252 **[Test build #90296 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90296/testReport)** for PR 21252 at commit [`0cbacc3`](https://github.com/apache/spark/commit/0cbacc36fcb2cf965f0697effe1bb6f02db5fe8f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21198: [SPARK-24126][pyspark] Use build-specific temp di...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21198 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21198: [SPARK-24126][pyspark] Use build-specific temp directory...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21198 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21092: [SPARK-23984][K8S] Initial Python Bindings for Py...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21092#discussion_r186326478 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile --- @@ -0,0 +1,34 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +ARG base_img +FROM $base_img +WORKDIR / +RUN mkdir ${SPARK_HOME}/python +COPY python/lib ${SPARK_HOME}/python/lib +RUN apk add --no-cache python && \ +apk add --no-cache python3 && \ +python -m ensurepip && \ +python3 -m ensurepip && \ +rm -r /usr/lib/python*/ensurepip && \ +pip install --upgrade pip setuptools && \ --- End diff -- this goes to python2 only, I think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21198: [SPARK-24126][pyspark] Use build-specific temp directory...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21198 yes, for R, during package check it can only write to the R tempdir() (which can be changed in a number of ways) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r186325093 --- Diff: dev/deps/spark-deps-hadoop-2.7 --- @@ -163,13 +163,13 @@ orc-mapreduce-1.4.3-nohive.jar oro-2.0.8.jar osgi-resource-locator-1.0.1.jar paranamer-2.8.jar -parquet-column-1.8.2.jar -parquet-common-1.8.2.jar -parquet-encoding-1.8.2.jar -parquet-format-2.3.1.jar -parquet-hadoop-1.8.2.jar +parquet-column-1.10.0.jar +parquet-common-1.10.0.jar +parquet-encoding-1.10.0.jar +parquet-format-2.4.0.jar +parquet-hadoop-1.10.0.jar parquet-hadoop-bundle-1.6.0.jar -parquet-jackson-1.8.2.jar +parquet-jackson-1.10.0.jar --- End diff -- (btw, don't forget to fix https://github.com/apache/spark/blob/ce7ba2e98e0a3b038e881c271b5905058c43155b/dev/deps/spark-deps-hadoop-3.1#L184-L190 too) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21240: [SPARK-21274][SQL] Add a new generator function replicat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21240 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2978/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21240: [SPARK-21274][SQL] Add a new generator function replicat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21240 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21092 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2895/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16478 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2977/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16478 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2976/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21240: [SPARK-21274][SQL] Add a new generator function replicat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21240 **[Test build #90295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90295/testReport)** for PR 21240 at commit [`4ab3af0`](https://github.com/apache/spark/commit/4ab3af0c1abfd0ac078c968dbe589bf96091). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21240: [SPARK-21274][SQL] Add a new generator function replicat...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21240 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21240: [SPARK-21274][SQL] Add a new generator function replicat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21240 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90287/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21240: [SPARK-21274][SQL] Add a new generator function replicat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21240 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21240: [SPARK-21274][SQL] Add a new generator function replicat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21240 **[Test build #90287 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90287/testReport)** for PR 21240 at commit [`4ab3af0`](https://github.com/apache/spark/commit/4ab3af0c1abfd0ac078c968dbe589bf96091). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21092 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2895/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21092 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2975/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21092 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21252: [SPARK-24193] Sort by disk when number of limit is big i...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21252 Instead of touching inside of `TakeOrderedAndProjectExec`, how about we don't replace `Sort` + `Limit` with `TakeOrderedAndProjectExec` when reaching the threshold? A.k.a: ```scala object SpecialLimits extends Strategy { override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { case ReturnAnswer(rootPlan) => rootPlan match { case Limit(IntegerLiteral(limit), Sort(order, true, child)) if limit < conf.sortInMemForLimitThreshold => TakeOrderedAndProjectExec(limit, order, child.output, planLater(child)) :: Nil ... } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21092 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2894/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16478 **[Test build #90294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90294/testReport)** for PR 16478 at commit [`ae00de1`](https://github.com/apache/spark/commit/ae00de13dd779a2a09b142c54a2fcc144d7f8c23). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21253: [SPARK-24158][SS] Enable no-data batches for streaming j...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/21253 @brkyvz Can you take a look? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21092 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2894/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21092 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21070 > sql("select * from parquetTable where value = '0'") > sql("select * from parquetTable where value = 0") Semantically, they are different. The results should be also different. The plan shows whether these predicates are pushed down or not. Without the upgrade to 1.10.0, we can't get the benefit from the predicate pushdown between the above two queries. I also double checked it in my local environment. Now, it sounds like it is good reason to upgrade the version, since string predicates are pretty common. @rdblue @maropu Thank you for your great efforts and patience! This upgrade is definitely good to have. We should try to make it in the next release. @maropu Could you also submit a separate PR for your micro benchmark? Thanks! cc @cloud-fan @hvanhovell @mswit-databricks Do you have any comment about the implementation of this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21092 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2974/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16677 **[Test build #90293 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90293/testReport)** for PR 16677 at commit [`062b8fd`](https://github.com/apache/spark/commit/062b8fd58ae13f252b1e6f61c70b69ed05521715). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21239: [SPARK-24040][SS] Support single partition aggregates in...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/21239 This looks pretty good to me. My only major concern is how we are enabling it. Adding such a wide enable-all flag is to enable this is not a good idea. Rather I would like it to be enabled surgically - - Unsupported operation checker allows the query when there is only 1 aggregate on streaming (this does not need to change when we add multi-partition aggregate) - ContinuousExecution/UnsupportedOperationChecker adds one additional check to verify whether shuffle partition is 1 (single line check, can be deleted when we add multi-partition aggregate). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLL...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21254#discussion_r186322250 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2313,6 +2313,25 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { } } + test("SPARK-23723: write json in UTF-16/32 with multiline off") { +Seq("UTF-16", "UTF-32").foreach { encoding => + withTempPath { path => +val ds = spark.createDataset(Seq( + ("a", 1), ("b", 2), ("c", 3)) +).repartition(2) --- End diff -- we don't have to repartition though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21239: [SPARK-24040][SS] Support single partition aggreg...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/21239#discussion_r186322252 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousAggregationSuite.scala --- @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming.continuous + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.execution.streaming.sources.ContinuousMemoryStream +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.streaming.OutputMode + +class ContinuousAggregationSuite extends ContinuousSuiteBase { + import testImplicits._ + + test("not enabled") { +val ex = intercept[AnalysisException] { + val input = ContinuousMemoryStream.singlePartition[Int] + + testStream(input.toDF().agg(max('value)), OutputMode.Complete)( +AddData(input, 0, 1, 2), +CheckAnswer(2), +StopStream, +AddData(input, 3, 4, 5), +StartStream(), +CheckAnswer(5), +AddData(input, -1, -2, -3), +CheckAnswer(5)) +} + +assert(ex.getMessage.contains("Continuous processing does not support Aggregate operations")) + } + + test("basic") { +withSQLConf(("spark.sql.streaming.continuous.allowAllOperators", "true")) { + val input = ContinuousMemoryStream.singlePartition[Int] + + testStream(input.toDF().agg(max('value)), OutputMode.Complete)( --- End diff -- Is only complete mode allowed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21254 Nope, I am quite sure that we don't have any kind of hidden behaviour change. Both `lineSep` and `encoding` options are new. Also, these are actually now restricter than it's actually needed for now. Writing can actually work and https://github.com/apache/spark/pull/21247 tries to allow it; however, I left a comment for him to just focus on getting rid of the restrictions, which is his final goal in 2.4.0 (or 3.0.0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21253: [SPARK-24158][SS] Enable no-data batches for streaming j...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21253 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21253: [SPARK-24158][SS] Enable no-data batches for streaming j...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21253 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2973/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21092: [SPARK-23984][K8S] Initial Python Bindings for Py...
Github user ifilonenko commented on a diff in the pull request: https://github.com/apache/spark/pull/21092#discussion_r186321782 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala --- @@ -63,10 +67,17 @@ private[spark] case class KubernetesConf[T <: KubernetesRoleSpecificConf]( .map(str => str.split(",").toSeq) .getOrElse(Seq.empty[String]) - def sparkFiles(): Seq[String] = sparkConf -.getOption("spark.files") -.map(str => str.split(",").toSeq) -.getOrElse(Seq.empty[String]) + def pyFiles(): Option[String] = sparkConf +.get(KUBERNETES_PYSPARK_PY_FILES) + + def pySparkMainResource(): Option[String] = sparkConf --- End diff -- I need to parse out the MainAppResource (which I thought we should be doing only once... as such, I thought it would be cleaner to do this... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21239: [SPARK-24040][SS] Support single partition aggreg...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/21239#discussion_r186321753 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousAggregationSuite.scala --- @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming.continuous + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.execution.streaming.sources.ContinuousMemoryStream +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.streaming.OutputMode + +class ContinuousAggregationSuite extends ContinuousSuiteBase { + import testImplicits._ + + test("not enabled") { +val ex = intercept[AnalysisException] { + val input = ContinuousMemoryStream.singlePartition[Int] + + testStream(input.toDF().agg(max('value)), OutputMode.Complete)( +AddData(input, 0, 1, 2), +CheckAnswer(2), +StopStream, +AddData(input, 3, 4, 5), +StartStream(), +CheckAnswer(5), +AddData(input, -1, -2, -3), --- End diff -- You dont need these many action just to check whether it throws AnalysisException --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21239: [SPARK-24040][SS] Support single partition aggreg...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/21239#discussion_r186321671 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousAggregationSuite.scala --- @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming.continuous + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.execution.streaming.sources.ContinuousMemoryStream +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.streaming.OutputMode + +class ContinuousAggregationSuite extends ContinuousSuiteBase { + import testImplicits._ + + test("not enabled") { +val ex = intercept[AnalysisException] { + val input = ContinuousMemoryStream.singlePartition[Int] + + testStream(input.toDF().agg(max('value)), OutputMode.Complete)( +AddData(input, 0, 1, 2), +CheckAnswer(2), +StopStream, +AddData(input, 3, 4, 5), +StartStream(), +CheckAnswer(5), +AddData(input, -1, -2, -3), +CheckAnswer(5)) +} + +assert(ex.getMessage.contains("Continuous processing does not support Aggregate operations")) + } + + test("basic") { +withSQLConf(("spark.sql.streaming.continuous.allowAllOperators", "true")) { --- End diff -- Why are we using this flag to enable this. I would expect that this will be enabled by default with a more specific feature flag (say, continuousAggregationsEnabled) to turn this off. Also, in that case, the checks inside UnsupportedOperationChecker will be more specific (single aggregation, with shuffle partition = 1, no watermark, etc. ). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21254 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21254 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2972/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to b...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21122 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21239: [SPARK-24040][SS] Support single partition aggreg...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/21239#discussion_r186321416 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDD.scala --- @@ -71,8 +72,15 @@ class StateStoreRDD[T: ClassTag, U: ClassTag]( StateStoreId(checkpointLocation, operatorId, partition.index), queryRunId) +// If we're in continuous processing mode, we should get the store version for the current +// epoch rather than the one at planning time. +val currentVersion = EpochTracker.getCurrentEpoch match { + case -1 => storeVersion --- End diff -- I dont like the fact that -1 is used as value for non-existence, especially since this value is present deep inside the EpochTracker. Might as well use Options to return None. Then the use of -1 will be contained within the EpochTracker class. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16478 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90290/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21254 **[Test build #90292 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90292/testReport)** for PR 21254 at commit [`d4c290e`](https://github.com/apache/spark/commit/d4c290e85eab07706a8a612dbbf58d5c14588b43). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21239: [SPARK-24040][SS] Support single partition aggreg...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/21239#discussion_r186321310 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/ContinuousMemoryStream.scala --- @@ -47,18 +47,17 @@ import org.apache.spark.util.RpcUtils *ContinuousMemoryStreamDataReader instances to poll. It returns the record at the specified *offset within the list, or null if that offset doesn't yet have a record. */ -class ContinuousMemoryStream[A : Encoder](id: Int, sqlContext: SQLContext) +class ContinuousMemoryStream[A : Encoder](id: Int, sqlContext: SQLContext, numPartitions: Int = 2) --- End diff -- +1, I had a itch about this in the continuous memory stream PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16478 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16478 **[Test build #90290 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90290/testReport)** for PR 16478 at commit [`9fd6c2f`](https://github.com/apache/spark/commit/9fd6c2f21353b41e11748a3f9edaf17647c8bd7f). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21239: [SPARK-24040][SS] Support single partition aggreg...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/21239#discussion_r186321237 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochTracker.scala --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming.continuous + +import java.util.concurrent.atomic.AtomicLong + +object EpochTracker { + // The current epoch. Note that this is a shared reference; ContinuousWriteRDD.compute() will --- End diff -- This goes for other methods below. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to b...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21122#discussion_r186321180 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1354,7 +1354,8 @@ class HiveDDLSuite val indexName = tabName + "_index" withTable(tabName) { // Spark SQL does not support creating index. Thus, we have to use Hive client. - val client = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client + val client = + spark.sharedState.externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog].client --- End diff -- The issue is out of scope this PR. We can continue addressing the related issues in the future PRs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21122 Thanks! Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21254 cc @MaxGekk @HyukjinKwon Do we have any behavior change after the previous PR: https://github.com/apache/spark/pull/20937? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21239: [SPARK-24040][SS] Support single partition aggreg...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/21239#discussion_r186321087 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochTracker.scala --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming.continuous + +import java.util.concurrent.atomic.AtomicLong + +object EpochTracker { + // The current epoch. Note that this is a shared reference; ContinuousWriteRDD.compute() will --- End diff -- Better to not refer to specific usages of methods, rather the general purpose of the method. In future, classes other than ContinuousWriteRDD can use it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21254: [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLL...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/21254 [SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] Support custom encoding for json files ## What changes were proposed in this pull request? This is to add a test case to check the behaviors when users write json in the specified UTF-16/UTF-32 encoding with multiline off. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark followupSPARK-23094 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21254.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21254 commit 07de099d9fe3b2a39380ce08d095a564ce6c00f4 Author: gatorsmileDate: 2018-05-07T03:35:52Z test case commit d4c290e85eab07706a8a612dbbf58d5c14588b43 Author: gatorsmile Date: 2018-05-07T03:37:02Z name --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21239: [SPARK-24040][SS] Support single partition aggreg...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/21239#discussion_r186320975 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochTracker.scala --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming.continuous + +import java.util.concurrent.atomic.AtomicLong + +object EpochTracker { --- End diff -- Add docs to explain the purpose and usage of this class. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21253: [SPARK-24158][SS] Enabled no-data batches for streaming ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21253 **[Test build #90291 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90291/testReport)** for PR 21253 at commit [`cb5f55b`](https://github.com/apache/spark/commit/cb5f55b4622fc8637950013a5f6a7005cecf9a07). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21252: [SPARK-24193] Sort by disk when number of limit i...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/21252#discussion_r186320836 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -127,13 +127,36 @@ case class TakeOrderedAndProjectExec( projectList: Seq[NamedExpression], child: SparkPlan) extends UnaryExecNode { + val sortInMemThreshold = conf.sortInMemForLimitThreshold + + override def requiredChildDistribution: Seq[Distribution] = { +if (limit < sortInMemThreshold) { + super.requiredChildDistribution +} else { + Seq(AllTuples) +} + } + + override def requiredChildOrdering: Seq[Seq[SortOrder]] = { +if (limit < sortInMemThreshold) { + super.requiredChildOrdering +} else { + Seq(sortOrder) +} + } --- End diff -- Thanks a lot for comments :) Our users have queries with big number of limit, which suffers OOM with current code. So I added a config, below which we go by current code, if not we do global sort. Yes there will be performance degradation. But it's better than OOM; I think a limit-ed heap sort by disk will be a better idea. But I'm didn't find an existing implementation and don't want to make it too complicated here. Thanks again for review this @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21253: [SPARK-24158][SS] Enabled no-data batches for str...
GitHub user tdas opened a pull request: https://github.com/apache/spark/pull/21253 [SPARK-24158][SS] Enabled no-data batches for streaming joins ## What changes were proposed in this pull request? This is a continuation of the larger task of enabling zero-data batches for more eager state cleanup. This PR enables it for stream-stream joins. ## How was this patch tested? - Updated join tests. Additionally, updated them to not use `CheckLastBatch` anywhere to set good precedence for future. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tdas/spark SPARK-24158 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21253.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21253 commit cb5f55b4622fc8637950013a5f6a7005cecf9a07 Author: Tathagata DasDate: 2018-05-03T02:55:35Z Enabled no-data batches for joins --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16478 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2971/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16478 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21239: [SPARK-24040][SS] Support single partition aggreg...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/21239#discussion_r186320417 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala --- @@ -123,15 +123,7 @@ class ContinuousExecution( } committedOffsets = nextOffsets.toStreamProgress(sources) -// Get to an epoch ID that has definitely never been sent to a sink before. Since sink -// commit happens between offset log write and commit log write, this means an epoch ID -// which is not in the offset log. -val (latestOffsetEpoch, _) = offsetLog.getLatest().getOrElse { - throw new IllegalStateException( -s"Offset log had no latest element. This shouldn't be possible because nextOffsets is" + - s"an element.") -} -currentBatchId = latestOffsetEpoch + 1 +currentBatchId = latestEpochId + 1 --- End diff -- nit: remove the line above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16478 **[Test build #90290 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90290/testReport)** for PR 16478 at commit [`9fd6c2f`](https://github.com/apache/spark/commit/9fd6c2f21353b41e11748a3f9edaf17647c8bd7f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21252: [SPARK-24193] Sort by disk when number of limit i...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21252#discussion_r186318646 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -127,13 +127,36 @@ case class TakeOrderedAndProjectExec( projectList: Seq[NamedExpression], child: SparkPlan) extends UnaryExecNode { + val sortInMemThreshold = conf.sortInMemForLimitThreshold + + override def requiredChildDistribution: Seq[Distribution] = { +if (limit < sortInMemThreshold) { + super.requiredChildDistribution +} else { + Seq(AllTuples) +} + } + + override def requiredChildOrdering: Seq[Seq[SortOrder]] = { +if (limit < sortInMemThreshold) { + super.requiredChildOrdering +} else { + Seq(sortOrder) +} + } --- End diff -- It was only shuffling limit-ed rows from each partitions. The final sort was also only applied on limited-data form all partitions. Now it requires sorting and shuffling on whole data. I suspect that it'd degrade performance in the end. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21198: [SPARK-24126][pyspark] Use build-specific temp directory...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21198 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21198: [SPARK-24126][pyspark] Use build-specific temp directory...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21198 **[Test build #90289 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90289/testReport)** for PR 21198 at commit [`474baf9`](https://github.com/apache/spark/commit/474baf9bc61363ae2d06dbb638abb895005f55fc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21198: [SPARK-24126][pyspark] Use build-specific temp directory...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21198 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90289/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21235: [SPARK-24181][SQL] Better error message for writi...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21235#discussion_r186316448 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -339,9 +339,16 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { } } - private def assertNotBucketed(operation: String): Unit = { -if (numBuckets.isDefined || sortColumnNames.isDefined) { - throw new AnalysisException(s"'$operation' does not support bucketing right now") --- End diff -- How about keeping the function name unchanged and just changing this message and list the sort columns if having. Something like: > '$operation' does not support bucketing. Number of buckets: ...; sortBy: ...; --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21212: [SPARK-24143] filter empty blocks when convert ma...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/21212#discussion_r186315362 --- Diff: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala --- @@ -267,28 +269,30 @@ final class ShuffleBlockFetcherIterator( // at most maxBytesInFlight in order to limit the amount of data in flight. val remoteRequests = new ArrayBuffer[FetchRequest] -// Tracks total number of blocks (including zero sized blocks) -var totalBlocks = 0 for ((address, blockInfos) <- blocksByAddress) { - totalBlocks += blockInfos.size if (address.executorId == blockManager.blockManagerId.executorId) { -// Filter out zero-sized blocks -localBlocks ++= blockInfos.filter(_._2 != 0).map(_._1) +blockInfos.find(_._2 <= 0) match { + case Some((blockId, size)) if size < 0 => +throw new BlockException(blockId, "Negative block size " + size) + case Some((blockId, size)) if size == 0 => +throw new BlockException(blockId, "Zero-sized blocks should be excluded.") --- End diff -- I think that failing with an exception here is a great idea, so thanks for adding these checks. In general, I'm in favor of adding explicit fail-fast checks for invariants like this because it can help to defend against silent corruption bugs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21070 For documentation, I think it's fine. it's a trackable history in JIRA although it's a bit hard to search .. Also, there might be some diff we should check since I haven't checked the details of JIRAs and if it's still true. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21198: [SPARK-24126][pyspark] Use build-specific temp directory...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21198 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2970/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21198: [SPARK-24126][pyspark] Use build-specific temp directory...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21198 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21198: [SPARK-24126][pyspark] Use build-specific temp directory...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21198 **[Test build #90289 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90289/testReport)** for PR 21198 at commit [`474baf9`](https://github.com/apache/spark/commit/474baf9bc61363ae2d06dbb638abb895005f55fc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21198: [SPARK-24126][pyspark] Use build-specific temp directory...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21198 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21231: [SPARK-24119][SQL]Add interpreted execution to SortPrefi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21231 **[Test build #90288 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90288/testReport)** for PR 21231 at commit [`dc9a070`](https://github.com/apache/spark/commit/dc9a0702584c9fd16a1d25c675aaf2b424cd7623). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21240: [SPARK-21274][SQL] Add a new generator function replicat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21240 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21240: [SPARK-21274][SQL] Add a new generator function replicat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21240 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2969/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21240: [SPARK-21274][SQL] Add a new generator function replicat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21240 **[Test build #90287 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90287/testReport)** for PR 21240 at commit [`4ab3af0`](https://github.com/apache/spark/commit/4ab3af0c1abfd0ac078c968dbe589bf96091). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21231: [SPARK-24119][SQL]Add interpreted execution to So...
Github user bersprockets commented on a diff in the pull request: https://github.com/apache/spark/pull/21231#discussion_r186307490 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SortOrderExpressionsSuite.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import java.sql.{Date, Timestamp} +import java.util.TimeZone + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String +import org.apache.spark.util.collection.unsafe.sort.PrefixComparators._ + +class SortOrderExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { + + test("SortPrefix") { +// Explicitly choose a time zone, since Date objects can create different values depending on +// local time zone of the machine on which the test is running +TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) --- End diff -- @kiszk Maybe not a nit. I should fix that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21231: [SPARK-24119][SQL]Add interpreted execution to So...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21231#discussion_r186307162 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SortOrderExpressionsSuite.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import java.sql.{Date, Timestamp} +import java.util.TimeZone + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String +import org.apache.spark.util.collection.unsafe.sort.PrefixComparators._ + +class SortOrderExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { + + test("SortPrefix") { +// Explicitly choose a time zone, since Date objects can create different values depending on +// local time zone of the machine on which the test is running +TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) --- End diff -- nit: do we need to restore the original time zone? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21070 @HyukjinKwon Thanks for the explanation! I feel it's ok to keep this as it it for now. btw, we don't need to document the story somewhere? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21251: [SPARK-10878][core] Fix race condition when multiple cli...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21251 cc @jiangxb1987 @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21106: [SPARK-23711][SQL][WIP] Add fallback logic for Un...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21106#discussion_r186305770 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CodegenObjectFactory.scala --- @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.codehaus.commons.compiler.CompileException +import org.codehaus.janino.InternalCompilerException + +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.Utils + +/** + * Catches compile error during code generation. + */ +object CodegenError { + def unapply(throwable: Throwable): Option[Exception] = throwable match { +case e: InternalCompilerException => Some(e) +case e: CompileException => Some(e) +case _ => None + } +} + +/** + * A factory which can be used to create objects that have both codegen and interpreted + * implementations. This tries to create codegen object first, if any compile error happens, + * it fallbacks to interpreted version. + */ +abstract class CodegenObjectFactory[IN, OUT] { + + // Creates wanted object. First trying codegen implementation. If any compile error happens, + // fallbacks to interpreted version. + def createObject(in: IN): OUT = { +val fallbackMode = SQLConf.get.getConf(SQLConf.CODEGEN_OBJECT_FALLBACK) --- End diff -- We can only access SQLConf on the driver. We can't know this config value inside `createObject` which can be called on executors. To solve this, `UnsafeProjection` can't be an object as now, but a case class with a fallback mode parameter. So we can create this factory on driver. @cloud-fan @hvanhovell WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21231: [SPARK-24119][SQL]Add interpreted execution to SortPrefi...
Github user bersprockets commented on the issue: https://github.com/apache/spark/pull/21231 @maropu @kiszk Hopefully I've addressed all comments. Please take a look. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21180 Hi @HyukjinKwon, could you kindly have a look at the new version? It is backwards compatible and has the same robustness guarantees as `collections.namedtuple` from CPython. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21182: [SPARK-24068] Propagating DataFrameReader's options to T...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21182 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21182: [SPARK-24068] Propagating DataFrameReader's options to T...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21182 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90286/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21182: [SPARK-24068] Propagating DataFrameReader's options to T...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21182 **[Test build #90286 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90286/testReport)** for PR 21182 at commit [`6610826`](https://github.com/apache/spark/commit/6610826fbb229ca509c0ae9e023b36253852af33). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21212: [SPARK-24143] filter empty blocks when convert mapstatus...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21212 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90283/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21212: [SPARK-24143] filter empty blocks when convert mapstatus...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21212 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21212: [SPARK-24143] filter empty blocks when convert mapstatus...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21212 **[Test build #90283 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90283/testReport)** for PR 21212 at commit [`094d829`](https://github.com/apache/spark/commit/094d829ff5c36cdc358f4d6e635297ca4f4e400f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18447 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90285/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18447 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18447 **[Test build #90285 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90285/testReport)** for PR 18447 at commit [`b964573`](https://github.com/apache/spark/commit/b96457399433279964575ff570f9364d628dd240). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org