[GitHub] spark pull request #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/22104#discussion_r210391237 --- Diff: python/pyspark/sql/tests.py --- @@ -3367,6 +3367,35 @@ def test_ignore_column_of_all_nulls(self): finally: shutil.rmtree(path) +# SPARK-24721 +def test_datasource_with_udf_filter_lit_input(self): +import pandas as pd +import numpy as np +from pyspark.sql.functions import udf, pandas_udf, lit, col + +path = tempfile.mkdtemp() +shutil.rmtree(path) +try: + self.spark.range(1).write.mode("overwrite").format('csv').save(path) +filesource_df = self.spark.read.csv(path) --- End diff -- @gatorsmile Added tests for file source, data source and data source v2. I might need to move the pandas_udf tests into another tests because of arrow_requirement :( --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21698 I tried a prototype to fix the handling of fetch failure, seems not that hard: https://github.com/apache/spark/pull/22112 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/22104#discussion_r210390770 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala --- @@ -133,6 +134,9 @@ object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper { } def apply(plan: SparkPlan): SparkPlan = plan transformUp { +// SPARK-24721: Ignore Python UDFs in DataSourceScan and DataSourceV2Scan +case plan: DataSourceScanExec => plan --- End diff -- I get rid of the logic previously in `FileSourceStrategy` to exclude PythonUDF in the filter in favor of this fix - I think this fix is cleaner. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #94821 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94821/testReport)** for PR 22112 at commit [`1f9f6e5`](https://github.com/apache/spark/commit/1f9f6e5b020038be1e7c11b9923010465da385aa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22104 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() da...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/22112 [WIP][SPARK-23243][Core] Fix RDD.repartition() data correctness issue ## What changes were proposed in this pull request? An alternative fix for https://github.com/apache/spark/pull/21698 RDD can take arbitrary user function, but we have an assumption: the function should produce same data set for same input, but the order can change. Spark scheduler must take care of this assumption when fetch failure happens, otherwise we may hit correctness issue as the JIRA ticket described. Generall speaking, when a map stage gets retried because of fetch failure, and this map stage is not idempotent(produce same data set but different order each time), and the shuffle partitioner is sensitive to the input data order(like round robin partitioner), we should retry all the reduce tasks. TODO: document and test You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark repartition Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22112.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22112 commit 1f9f6e5b020038be1e7c11b9923010465da385aa Author: Wenchen Fan Date: 2018-08-15T18:38:24Z fix repartition+shuffle bug --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22104 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2223/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/22104#discussion_r210390399 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonExec.scala --- @@ -117,15 +117,16 @@ abstract class EvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], chil } }.toArray }.toArray - val projection = newMutableProjection(allInputs, child.output) + val projection = UnsafeProjection.create(allInputs, child.output) val schema = StructType(dataTypes.zipWithIndex.map { case (dt, i) => StructField(s"_$i", dt) }) // Add rows to queue to join later with the result. val projectedRowIter = iter.map { inputRow => -queue.add(inputRow.asInstanceOf[UnsafeRow]) -projection(inputRow) +val unsafeRow = projection(inputRow) +queue.add(unsafeRow.asInstanceOf[UnsafeRow]) --- End diff -- This is probably another bug I found in testing this - If the input node to EvalPythonExec doesn't produce UnsafeRow, and cast here will fail. I found this in testing when I pass in an test data source scan node, which produces GeneralInternalRow, will throw exception here. I am happy to submit this as a separate patch if people think it's necessary --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21977 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21889 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21977 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94804/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22104 **[Test build #94820 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94820/testReport)** for PR 22104 at commit [`38f3dbb`](https://github.com/apache/spark/commit/38f3dbbbd7d77b59b8441daf14f3a94ead1401b9). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21889 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94805/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22104 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94820/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22104 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21977: SPARK-25004: Add spark.executor.pyspark.memory limit.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21977 **[Test build #94804 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94804/testReport)** for PR 21977 at commit [`6810fc7`](https://github.com/apache/spark/commit/6810fc7f30131750a1a472467b7984a2c9150a9b). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21889 **[Test build #94805 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94805/testReport)** for PR 21889 at commit [`8d822ee`](https://github.com/apache/spark/commit/8d822eea805e1b2dc40b866ca8ac4893e53ad51b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22104: [SPARK-24721][SQL] Exclude Python UDFs filters in FileSo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22104 **[Test build #94820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94820/testReport)** for PR 22104 at commit [`38f3dbb`](https://github.com/apache/spark/commit/38f3dbbbd7d77b59b8441daf14f3a94ead1401b9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22013: [SPARK-23939][SQL] Add transform_keys function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22013 **[Test build #94819 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94819/testReport)** for PR 22013 at commit [`2f4943f`](https://github.com/apache/spark/commit/2f4943f3cec0705c296b2988c415ac3372b7ea86). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak an...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/22106#discussion_r210387003 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala --- @@ -40,12 +40,7 @@ class KafkaContinuousSinkSuite extends KafkaContinuousTest { override val streamingTimeout = 30.seconds - override def beforeAll(): Unit = { -super.beforeAll() -testUtils = new KafkaTestUtils( - withBrokerProps = Map("auto.create.topics.enable" -> "false")) -testUtils.setup() - } + override val brokerProps = Map("auto.create.topics.enable" -> "false") --- End diff -- This is the fix for Kafka cluster leak --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22106 **[Test build #4262 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4262/testReport)** for PR 22106 at commit [`63cc11d`](https://github.com/apache/spark/commit/63cc11dfa575ac25ee3751a93a2cb5a6b9886218). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22106 **[Test build #4261 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4261/testReport)** for PR 22106 at commit [`63cc11d`](https://github.com/apache/spark/commit/63cc11dfa575ac25ee3751a93a2cb5a6b9886218). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22106 **[Test build #4260 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4260/testReport)** for PR 22106 at commit [`63cc11d`](https://github.com/apache/spark/commit/63cc11dfa575ac25ee3751a93a2cb5a6b9886218). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22106 **[Test build #4259 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4259/testReport)** for PR 22106 at commit [`63cc11d`](https://github.com/apache/spark/commit/63cc11dfa575ac25ee3751a93a2cb5a6b9886218). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21308 **[Test build #94818 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94818/testReport)** for PR 21308 at commit [`e32e6c4`](https://github.com/apache/spark/commit/e32e6c4a4c3df527a5fddb8b694b0ed303e16fc4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22013: [SPARK-23939][SQL] Add transform_keys function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22013 **[Test build #94817 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94817/testReport)** for PR 22013 at commit [`58b60b2`](https://github.com/apache/spark/commit/58b60b2f851fb1464743257fe1cca075a1e77ba9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21308 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21308 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified// Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22106 **[Test build #94816 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94816/testReport)** for PR 22106 at commit [`63cc11d`](https://github.com/apache/spark/commit/63cc11dfa575ac25ee3751a93a2cb5a6b9886218). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22106 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22106 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2221/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22106 **[Test build #94815 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94815/testReport)** for PR 22106 at commit [`621099e`](https://github.com/apache/spark/commit/621099e39bd1faa8235f2a2a6e7db68939b68f81). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak an...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/22106#discussion_r210383608 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala --- @@ -216,7 +216,7 @@ class KafkaContinuousInputPartitionReader( } catch { // We didn't read within the timeout. We're supposed to block indefinitely for new data, so // swallow and ignore this. -case _: TimeoutException => +case _: TimeoutException | _: org.apache.kafka.common.errors.TimeoutException => --- End diff -- This is to fix https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4254/ `org.apache.kafka.common.errors.TimeoutException: Timeout of 3000ms expired before the position for partition failOnDataLoss-2-0 could be determined` triggered a task retry but as continuous processing doesn't support task retries, it failed with `org.apache.spark.sql.execution.streaming.continuous.ContinuousTaskRetryException: Continuous execution does not support task retry`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22106 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2219/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21308 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2220/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21308 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22106: [SPARK-25116][TESTS]Fix the Kafka cluster leak and clean...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22106 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21308 @rxin, I've updated this API to use `Filter` instead of `Expression`. I'd ideally like to get it in soon if you guys have a chance to review it. It's pretty small. cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21308: SPARK-24253: Add DeleteSupport mix-in for DataSou...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21308#discussion_r210382412 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2; + +import org.apache.spark.sql.catalyst.expressions.Expression; + +/** + * A mix-in interface for {@link DataSourceV2} delete support. Data sources can implement this + * interface to provide the ability to delete data from tables that matches filter expressions. + * + * Data sources must implement this interface to support logical operations that combine writing + * data with deleting data, like overwriting partitions. + */ +public interface DeleteSupport extends DataSourceV2 { + /** + * Delete data from a data source table that matches filter expressions. + * + * Rows are deleted from the data source iff all of the filter expressions match. That is, the + * expressions must be interpreted as a set of filters that are ANDed together. + * + * Implementations may reject a delete operation if the delete isn't possible without significant + * effort. For example, partitioned data sources may reject deletes that do not filter by + * partition columns because the filter may require rewriting files without deleted records. + * To reject a delete implementations should throw {@link IllegalArgumentException} with a clear + * error message that identifies which expression was rejected. + * + * Implementations may throw {@link UnsupportedOperationException} if the delete operation is not --- End diff -- After updating this to use Filter, the UnsupportedOperationException is no longer needed, so I removed it. That should also cut down on the confusion here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21308 **[Test build #94814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94814/testReport)** for PR 21308 at commit [`db77b9a`](https://github.com/apache/spark/commit/db77b9a3f6e2364c2107c5dfd60ff57b1d7caa0f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21912 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21912 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2218/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21909 **[Test build #94813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94813/testReport)** for PR 21909 at commit [`6b98f3e`](https://github.com/apache/spark/commit/6b98f3edf19b6ca0887224c598d6f3fa88a762d1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21912 **[Test build #94812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94812/testReport)** for PR 21912 at commit [`240936e`](https://github.com/apache/spark/commit/240936e98c6b32855236c7c635bacc0fb04acb76). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/21561#discussion_r210373934 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -246,6 +245,16 @@ class BisectingKMeans private ( new BisectingKMeansModel(root, this.distanceMeasure) } + /** + * Runs the bisecting k-means algorithm. + * @param input RDD of vectors + * @return model for the bisecting kmeans + */ + @Since("1.6.0") --- End diff -- Nit: this should be since 2.4? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22013: [SPARK-23939][SQL] Add transform_keys function
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/22013#discussion_r210373675 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -2302,6 +2302,97 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { assert(ex5.getMessage.contains("function map_zip_with does not support ordering on type map")) } + test("transform keys function - test various primitive data types combinations") { +val dfExample1 = Seq( + Map[Int, Int](1 -> 1, 9 -> 9, 8 -> 8, 7 -> 7) +).toDF("i") + +val dfExample2 = Seq( + Map[Int, Double](1 -> 1.0E0, 2 -> 1.4E0, 3 -> 1.7E0) +).toDF("j") + +val dfExample3 = Seq( + Map[Int, Boolean](25 -> true, 26 -> false) +).toDF("x") + +val dfExample4 = Seq( + Map[Array[Int], Boolean](Array(1, 2) -> false) +).toDF("y") + + +def testMapOfPrimitiveTypesCombination(): Unit = { + checkAnswer(dfExample1.selectExpr("transform_keys(i, (k, v) -> k + v)"), +Seq(Row(Map(2 -> 1, 18 -> 9, 16 -> 8, 14 -> 7 + + checkAnswer(dfExample2.selectExpr("transform_keys(j, " + +"(k, v) -> map_from_arrays(ARRAY(1, 2, 3), ARRAY('one', 'two', 'three'))[k])"), +Seq(Row(Map("one" -> 1.0, "two" -> 1.4, "three" -> 1.7 + + checkAnswer(dfExample2.selectExpr("transform_keys(j, (k, v) -> CAST(v * 2 AS BIGINT) + k)"), +Seq(Row(Map(3 -> 1.0, 4 -> 1.4, 6 -> 1.7 + + checkAnswer(dfExample2.selectExpr("transform_keys(j, (k, v) -> k + v)"), +Seq(Row(Map(2.0 -> 1.0, 3.4 -> 1.4, 4.7 -> 1.7 + + checkAnswer(dfExample3.selectExpr("transform_keys(x, (k, v) -> k % 2 = 0 OR v)"), +Seq(Row(Map(true -> true, true -> false + + checkAnswer(dfExample3.selectExpr("transform_keys(x, (k, v) -> if(v, 2 * k, 3 * k))"), +Seq(Row(Map(50 -> true, 78 -> false + + checkAnswer(dfExample3.selectExpr("transform_keys(x, (k, v) -> if(v, 2 * k, 3 * k))"), +Seq(Row(Map(50 -> true, 78 -> false + + checkAnswer(dfExample4.selectExpr("transform_keys(y, (k, v) -> array_contains(k, 3) AND v)"), +Seq(Row(Map(false -> false +} +// Test with local relation, the Project will be evaluated without codegen +testMapOfPrimitiveTypesCombination() +dfExample1.cache() +dfExample2.cache() +dfExample3.cache() +dfExample4.cache() +// Test with cached relation, the Project will be evaluated with codegen +testMapOfPrimitiveTypesCombination() + } + + test("transform keys function - Invalid lambda functions and exceptions") { +val dfExample1 = Seq( + Map[Int, Int](1 -> 1, 9 -> 9, 8 -> 8, 7 -> 7) +).toDF("i") + +val dfExample2 = Seq( + Map[String, String]("a" -> "b") +).toDF("j") + +val dfExample3 = Seq( + Map[String, String]("a" -> null) +).toDF("x") + +def testInvalidLambdaFunctions(): Unit = { + val ex1 = intercept[AnalysisException] { +dfExample1.selectExpr("transform_keys(i, k -> k )") + } + assert(ex1.getMessage.contains("The number of lambda function arguments '1' does not match")) + + val ex2 = intercept[AnalysisException] { +dfExample2.selectExpr("transform_keys(j, (k, v, x) -> k + 1)") + } + assert(ex2.getMessage.contains( + "The number of lambda function arguments '3' does not match")) + + val ex3 = intercept[RuntimeException] { +dfExample3.selectExpr("transform_keys(x, (k, v) -> v)").show() + } + assert(ex3.getMessage.contains("Cannot use null as map key!")) +} + +testInvalidLambdaFunctions() +dfExample1.cache() +dfExample2.cache() +testInvalidLambdaFunctions() --- End diff -- @ueshin I would like to ask you a generic question regarding higher-order functions. Is it necessary to perform checks with codegen paths if all the newly added functions extends from ```CodegenFallback```? Eventually, is there a plan to add coden for these functions in future? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22084: [SPARK-25026][BUILD] Binary releases should contain some...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22084 I'm going to say let's leave this alone for now then if there isn't consensus. We can easily reopen later if someone wanted to just add the JARs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22084: [SPARK-25026][BUILD] Binary releases should conta...
Github user srowen closed the pull request at: https://github.com/apache/spark/pull/22084 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22013: [SPARK-23939][SQL] Add transform_keys function
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/22013#discussion_r210366383 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala --- @@ -497,6 +497,62 @@ case class ArrayAggregate( override def prettyName: String = "aggregate" } +/** + * Transform Keys for every entry of the map by applying the transform_keys function. + * Returns map with transformed key entries + */ +@ExpressionDescription( + usage = "_FUNC_(expr, func) - Transforms elements in a map using the function.", + examples = """ +Examples: + > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + 1); + map(array(2, 3, 4), array(1, 2, 3)) + > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + v); + map(array(2, 4, 6), array(1, 2, 3)) + """, + since = "2.4.0") +case class TransformKeys( +argument: Expression, +function: Expression) + extends MapBasedSimpleHigherOrderFunction with CodegenFallback { + + override def nullable: Boolean = argument.nullable + + @transient lazy val MapType(keyType, valueType, valueContainsNull) = argument.dataType + + override def dataType: DataType = { +MapType(function.dataType, valueType, valueContainsNull) --- End diff -- nit: just in one line? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22013: [SPARK-23939][SQL] Add transform_keys function
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/22013#discussion_r210368929 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala --- @@ -497,6 +497,62 @@ case class ArrayAggregate( override def prettyName: String = "aggregate" } +/** + * Transform Keys for every entry of the map by applying the transform_keys function. + * Returns map with transformed key entries + */ +@ExpressionDescription( + usage = "_FUNC_(expr, func) - Transforms elements in a map using the function.", + examples = """ +Examples: + > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + 1); + map(array(2, 3, 4), array(1, 2, 3)) + > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + v); + map(array(2, 4, 6), array(1, 2, 3)) + """, + since = "2.4.0") +case class TransformKeys( +argument: Expression, +function: Expression) + extends MapBasedSimpleHigherOrderFunction with CodegenFallback { + + override def nullable: Boolean = argument.nullable + + @transient lazy val MapType(keyType, valueType, valueContainsNull) = argument.dataType + + override def dataType: DataType = { +MapType(function.dataType, valueType, valueContainsNull) + } + + override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction): TransformKeys = { +copy(function = f(function, (keyType, false) :: (valueType, valueContainsNull) :: Nil)) + } + + @transient lazy val LambdaFunction( + _, (keyVar: NamedLambdaVariable) :: (valueVar: NamedLambdaVariable) :: Nil, _) = function + + + override def nullSafeEval(inputRow: InternalRow, argumentValue: Any): Any = { +val map = argumentValue.asInstanceOf[MapData] +val f = functionForEval --- End diff -- Can't we use ```functionForEval``` directly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 AnalysisBarrier does not introduce a behavior change. However, this requires our analyzer rules must be idempotent. The most recent correctness bug also shows another big potential hole https://issues.apache.org/jira/browse/SPARK-25051. It could break Analyzer rules without notice, since AnalysisBarrier is a leaf node. Basically, this is a disaster for Spark 2.3 release. You can count how many regressions we found after the release. This is bad. Unfortunately, I made the merge without enough consideration. If possible, I would revert that PR, but it is too late. Now, we are facing the similar issue again. We should stop continuing this direction without a proper design and review. Designing/enhancing the codegen framework requires more inputs from the experts in this area. I do not think the current design is well discussed, even if I saw some discussions in the initial PR. I am not the expert in this area. That is why I pinged @kiszk to drive this. BTW, the internal APIs we can change easily, but the design need a review especially for the core components of Spark SQL. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22101: [SPARK-25114][Core] Fix RecordBinaryComparator when subt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22101 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22101: [SPARK-25114][Core] Fix RecordBinaryComparator when subt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22101 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94803/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22101: [SPARK-25114][Core] Fix RecordBinaryComparator when subt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22101 **[Test build #94803 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94803/testReport)** for PR 22101 at commit [`9c1f486`](https://github.com/apache/spark/commit/9c1f4860ea794e60a127ae3ec2d1b518a182edf5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22042 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94809/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22042 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22042 **[Test build #94809 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94809/testReport)** for PR 22042 at commit [`a9b00b4`](https://github.com/apache/spark/commit/a9b00b4a22f0b6b364cd1b35e2d99923d8b233dc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22042 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94808/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22042 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22042 **[Test build #94808 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94808/testReport)** for PR 22042 at commit [`baef29f`](https://github.com/apache/spark/commit/baef29f2983560c8010681c9bb7e74f711c8f2e7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20637 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94806/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20637 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21537 I am sorry if I misunderstood. This was discussed sufficiently in the linked PRs. These are internal APIs and does not introduce behaviour changes by AnalysisBarrier which was a good enough idea to try because it brought many benefits to reduce reanalyzing plans - this resolved many JIRA on the other hand, and I don't think AnalysisBarrier was particularly a disaster. Sometimes we should take a step forward and see if that causes an actual problem or not too. That was a good enough try. When something should be reverted, there should usually be specific concerns; otherwise, I wouldn't revert this for vague concerns for now. This is an internal API that we can freely change later as well. Also, @viirya made a lot of efforts here and he's pretty active. I don't think dragging someone into this is a good idea for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20637 **[Test build #94806 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94806/testReport)** for PR 20637 at commit [`99731ca`](https://github.com/apache/spark/commit/99731ca058f0d8946397530aff76d3c55fa93162). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class Schema(dataType: DataType, nullable: Boolean)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22013: [SPARK-23939][SQL] Add transform_keys function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22013 **[Test build #94811 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94811/testReport)** for PR 22013 at commit [`fb885f4`](https://github.com/apache/spark/commit/fb885f4797e72d0c2cbfa23980199c71e0c5aaee). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22001 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21912 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21912 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2217/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22001 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94801/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21912 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22001: [SPARK-24819][CORE] Fail fast when no enough slots to la...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22001 **[Test build #94801 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94801/testReport)** for PR 22001 at commit [`c9036aa`](https://github.com/apache/spark/commit/c9036aab22cfd6b7a4939f4b23741612706ba2a6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21912 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94810/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21912 **[Test build #94810 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94810/testReport)** for PR 21912 at commit [`b899259`](https://github.com/apache/spark/commit/b899259125ed2c7b94c0b121d6cbdeb4ee90285a). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21912 **[Test build #94810 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94810/testReport)** for PR 21912 at commit [`b899259`](https://github.com/apache/spark/commit/b899259125ed2c7b94c0b121d6cbdeb4ee90285a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22110: [SPARK-25122][SQL] Deduplication of supports equa...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/22110#discussion_r210357474 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala --- @@ -73,4 +73,14 @@ object TypeUtils { } x.length - y.length } + + /** + * Returns true if elements of the data type could be used as items of a hash set or as keys --- End diff -- What about changing the name of the method to ```typeCanBeHashed```? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 This is not related to the credit or not. I think we need to introduce an IR like what the compiler is doing instead of continuous improvement on the existing one, which is already very hacky and hard to debug and maintain. We are wasting the efforts and resources if we follow the current direction. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22042 **[Test build #94809 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94809/testReport)** for PR 22042 at commit [`a9b00b4`](https://github.com/apache/spark/commit/a9b00b4a22f0b6b364cd1b35e2d99923d8b233dc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22042 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2216/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22042 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21537 Sure, I am fine with @kiszk leading the effort of course, I just wanted to give credit to @viirya for the (current) design as it was mostly done by him. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22042 **[Test build #94808 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94808/testReport)** for PR 22042 at commit [`baef29f`](https://github.com/apache/spark/commit/baef29f2983560c8010681c9bb7e74f711c8f2e7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22105: [SPARK-25115] [Core] Eliminate extra memory copy ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22105#discussion_r210354443 --- Diff: common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java --- @@ -140,8 +140,24 @@ private int copyByteBuf(ByteBuf buf, WritableByteChannel target) throws IOExcept // SPARK-24578: cap the sub-region's size of returned nio buffer to improve the performance // for the case that the passed-in buffer has too many components. int length = Math.min(buf.readableBytes(), NIO_BUFFER_LIMIT); --- End diff -- Maybe @normanmaurer has some insight? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22042 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22042 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2215/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22110: [SPARK-25122][SQL] Deduplication of supports equa...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22110#discussion_r210353947 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala --- @@ -73,4 +73,14 @@ object TypeUtils { } x.length - y.length } + + /** + * Returns true if elements of the data type could be used as items of a hash set or as keys --- End diff -- I think it is a common pattern for every well formed class to have equals and hashCode defined in a coherent way. I think what we are doing here is saying: this class has a meaningful equals method, so we can rely on it or this class has a meaningless equals method, like the default one comparing the pointers, so we cannot rely on it. I am open too to any suggestion, I'd only like to have a description which is coherent with the method name, otherwise I feel that either one or the other has to be changed in order to properly reflect what the method does. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21698 so I think the assumption is that task results are idempotent but not ordered. Sorry if that contradictory. The data itself has to be the same on rerun but the order of things in there doesn't. That was my general assumption. I think zip doesn't follow that though when the inputs aren't ordered. Not sure if there are others spark supports, need to go through the list I guess, unless someone already has? I think we just need to document these operations and say the results can be inconsistent if not sorted or perhaps give them an option to also sort. Either that or we have to say we don't support unordered output at all in Spark.Thoughts on just documenting zip or others with unordered input? I don't think mapreduce and pig have this issue because they don't internally support an operation like zip, everything is on key/values and joins, groupby on the keys. User code there could generate it as well but I would claim its the users fault there. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 This is bad. The design needs to be carefully reviewed before implementing it. Basically, we breaks the basic principles of software engineering. It is very strange to write the design doc after introducing such a fundamental API change. I think we need more expertise in the existing compiler internal experts. That is why I proposed to let @kiszk to lead this. We are not inventing anything new in our codegen framework. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22105: [SPARK-25115] [Core] Eliminate extra memory copy ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/22105#discussion_r210353176 --- Diff: common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java --- @@ -140,8 +140,24 @@ private int copyByteBuf(ByteBuf buf, WritableByteChannel target) throws IOExcept // SPARK-24578: cap the sub-region's size of returned nio buffer to improve the performance // for the case that the passed-in buffer has too many components. int length = Math.min(buf.readableBytes(), NIO_BUFFER_LIMIT); --- End diff -- That was only briefly discussed in the PR that Ryan linked above... the original code actually used 512k. I think Hadoop's limit is a little low, but maybe 256k is a bit high. IIRC socket buffers are 32k by default on Linux, so it seems unlikely you'd be able to write 256k in one call (ignoring what IOUtil does internally). But maybe in practice it works ok. If anyone has the time to test this out that would be great. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21537 @gatorsmile I think this is just the partial adoption of what was proposed [here](https://github.com/apache/spark/pull/19813#issuecomment-354045400) and implemented in SPARK-24121 (https://github.com/apache/spark/pull/21193). Despite I agree with you that we should have created a design doc for this, I think now is a bit late for it. Indeed the base API is already there, since the design has mostly been done in SPARK-24121 (and not in the scope of this PR). What is missing now, though, is a throughout adoption. But there are already many PRs which based on that API: SPARK-22856, SPARK-23951. And the current codebase already partially adopts it. So I am not sure whether reverting this PR would be a solution: we should revert many if we want the change the current design. I think a design can be created anyway, at least in order to formalize what was discussed and agreed in the various PRs and have a single source of information. We can also do some modifications to the current design according to the comments which could come out discussing the design doc, but I see them more like further changes than reverting this, as it would mean reverting many things. The only last thing I'd suggest is that probably @viirya is the best one for creating the design doc, since the original proposal and implementation was done by him. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22110: [SPARK-25122][SQL] Deduplication of supports equa...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/22110#discussion_r210350632 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala --- @@ -73,4 +73,14 @@ object TypeUtils { } x.length - y.length } + + /** + * Returns true if elements of the data type could be used as items of a hash set or as keys --- End diff -- I'm open to any changes :) But if you want to explicitly mention the ```equals``` method, I would also mention ```hashCode``` generally needed for usage in "hash" collections. But then this not 100% true for Spark's specialized ```OpenHashSets``` and ```OpenHashMaps``` since they calculate hash by themselves. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22105: [SPARK-25115] [Core] Eliminate extra memory copy ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22105#discussion_r210350203 --- Diff: common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java --- @@ -140,8 +140,24 @@ private int copyByteBuf(ByteBuf buf, WritableByteChannel target) throws IOExcept // SPARK-24578: cap the sub-region's size of returned nio buffer to improve the performance // for the case that the passed-in buffer has too many components. int length = Math.min(buf.readableBytes(), NIO_BUFFER_LIMIT); --- End diff -- Out of my curiosity, how do we come out with number of NIO_BUFFER_LIMIT 256KB? In Hadoop, they are using [8KB](https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java#L3234) For most OSes, in the `write(ByteBuffer[])` API in `sun.nio.ch.IOUtil`, it goes one buffer at a time, and gets a temporary direct buffer from the `BufferCache`, up to a limit of `IOUtil#IOV_MAX` which is 1KB. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r210350101 --- Diff: docs/sql-programming-guide.md --- @@ -1892,6 +1892,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 2.4, Spark respects Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files. - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`. + - Since Spark 2.4, text-based datasources like CSV and JSON don't parse input lines if the required schema pushed down to the datasources is empty. The schema can be empty in the case of count(), for example. To set `true` to `spark.sql.legacy.bypassParserForEmptySchema` restores the previous behavior when the underlying parser is always invoked even for the empty schema. This option will be removed in Spark 3.0. --- End diff -- > Does it also mean the result of count() can change if the json/csv files contain malformed records? @cloud-fan Only the difference I have found so far is when a whole JSON file is invalid i.e it has wrong enconding, for example. In other cases `count()` returns the same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/21889 > Due to the urgency of the upcoming 2.4 code freeze, I'm going to open this PR to collect any feedback. This can be closed if you prefer to continue to the work in the original PR. That would be my preference, yes, especially if it means less work for you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/21320 > @mallman if you're planning on making more code changes, would you be willing to work on a shared branch or something? I've been working to incorporate the CR comments. No, however if you want to open a PR against the VideoAmp spark-4502-parquet_column_pruning-foundation branch I will review your changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22009 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94802/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22009 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22009 **[Test build #94802 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94802/testReport)** for PR 22009 at commit [`e6e599a`](https://github.com/apache/spark/commit/e6e599a9630801078c046d3aec398cf4f046945c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21123: [SPARK-24045][SQL]Create base class for file data source...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21123 > We should do things incrementally and always prepare for the worst case. This is why I'm pushing for Append support and adding interfaces to finish the logical plans. Releasing logical plans as we get them tested and working *is* the incremental solution. Appending to existing tables and having full read support is fine for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21123: [SPARK-24045][SQL]Create base class for file data source...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21123 > this mapping is not mentioned in the logical plan standardization design doc and I doubt if it's doable I agree! This is why I propose we add an entirely new API for v2 with clear behavior. This is also why I'm working on the logical plans and interfaces instead of the public API right now. > I don't agree according to the fact that these new logical plans are not ready. I don't think that wanting to use v2 before it is ready is a justification for releasing a path with unpredictable behavior that we don't intend to continue support for. We should let the v1 write API continue to use v1 behavior and fix behavior for v2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22110: [SPARK-25122][SQL] Deduplication of supports equa...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22110#discussion_r210343741 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala --- @@ -73,4 +73,14 @@ object TypeUtils { } x.length - y.length } + + /** + * Returns true if elements of the data type could be used as items of a hash set or as keys --- End diff -- I think this comment is not very coherent with the method name. I think we can rephrase to something like: ``` Returns true if the equal method of the elements of the data type is implemented properly. This also means that they can be safely used in collections relying on the equals method, as sets or maps. ``` What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org