[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22957 **[Test build #99375 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99375/testReport)** for PR 22957 at commit [`0491249`](https://github.com/apache/spark/commit/049124971eb9d0f84eb98cc33c4cccd65d7a42b7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23031 **[Test build #99374 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99374/testReport)** for PR 23031 at commit [`4e7b6bb`](https://github.com/apache/spark/commit/4e7b6bb704f2aff1c091b73c08a167c4a6478534). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22957 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22957 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5451/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23086#discussion_r237078844 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java --- @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2; + +import org.apache.spark.annotation.Evolving; +import org.apache.spark.sql.sources.DataSourceRegister; +import org.apache.spark.sql.types.StructType; + +/** + * The base interface for v2 data sources which don't have a real catalog. Implementations must + * have a public, 0-arg constructor. + * + * The major responsibility of this interface is to return a {@link Table} for read/write. + */ +@Evolving +// TODO: do not extend `DataSourceV2`, after we finish the API refactor completely. +public interface TableProvider extends DataSourceV2 { + + /** + * Return a {@link Table} instance to do read/write with user-specified options. + * + * @param options the user-specified options that can identify a table, e.g. file path, Kafka + *topic name, etc. It's an immutable case-insensitive string-to-string map. + */ + Table getTable(DataSourceOptions options); + + /** + * Return a {@link Table} instance to do read/write with user-specified schema and options. + * + * By default this method throws {@link UnsupportedOperationException}, implementations should --- End diff -- what I learned is that, we should only declare checked exceptions. See http://www.javapractices.com/topic/TopicAction.do?Id=171 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23162 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99365/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23162 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23162 **[Test build #99365 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99365/testReport)** for PR 23162 at commit [`e9aba19`](https://github.com/apache/spark/commit/e9aba19b526610f3f31fa6a5b56140f6be8dc1c1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22957#discussion_r237076213 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -195,14 +195,26 @@ abstract class Expression extends TreeNode[Expression] { } /** - * Returns true when two expressions will always compute the same result, even if they differ + * Returns true when two expressions will always compute the same output, even if they differ * cosmetically (i.e. capitalization of names in attributes may be different). * * See [[Canonicalize]] for more details. */ def semanticEquals(other: Expression): Boolean = deterministic && other.deterministic && canonicalized == other.canonicalized + /** + * Returns true when two expressions will always compute the same result, even if the output may + * be different, because of different names or similar differences. + * Usually this means they their canonicalized form equals, but it may also not be the case, as + * different output expressions can evaluate to the same result as well (eg. when an expression + * is aliased). + */ + def sameResult(other: Expression): Boolean = other match { +case a: Alias => sameResult(a.child) +case _ => this.semanticEquals(other) --- End diff -- we can do ``` CleanupAliases.trimAliases(this) semanticEquals CleanupAliases.trimAliases(other) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23031: [SPARK-26060][SQL] Track SparkConf entries and ma...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23031#discussion_r237075228 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -2715,4 +2715,11 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { } } } + + test("set command rejects SparkConf entries") { +val ex = intercept[AnalysisException] { + sql("SET spark.task.cpus = 4") --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23031: [SPARK-26060][SQL] Track SparkConf entries and ma...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23031#discussion_r237075170 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/RuntimeConfigSuite.scala --- @@ -68,4 +68,13 @@ class RuntimeConfigSuite extends SparkFunSuite { assert(!conf.isModifiable("")) assert(!conf.isModifiable("invalid config parameter")) } + + test("reject SparkConf entries") { +val conf = newConf() + +val ex = intercept[AnalysisException] { + conf.set("spark.task.cpus", 4) --- End diff -- can we use `config.CPUS_PER_TASK` instead of hardcoding it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22957#discussion_r237075431 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -195,14 +195,26 @@ abstract class Expression extends TreeNode[Expression] { } /** - * Returns true when two expressions will always compute the same result, even if they differ + * Returns true when two expressions will always compute the same output, even if they differ * cosmetically (i.e. capitalization of names in attributes may be different). * * See [[Canonicalize]] for more details. */ def semanticEquals(other: Expression): Boolean = deterministic && other.deterministic && canonicalized == other.canonicalized + /** + * Returns true when two expressions will always compute the same result, even if the output may + * be different, because of different names or similar differences. + * Usually this means they their canonicalized form equals, but it may also not be the case, as + * different output expressions can evaluate to the same result as well (eg. when an expression + * is aliased). + */ + def sameResult(other: Expression): Boolean = other match { +case a: Alias => sameResult(a.child) +case _ => this.semanticEquals(other) --- End diff -- well, it needs to be overridden by `HashPartitioning` too, so since I am not able to make it final anyway, I don't think it is a good idea. Well, I can add a match on `HashPartitioning`too, but it doesn't seem a clean solution to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23031: [SPARK-26060][SQL] Track SparkConf entries and ma...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23031#discussion_r237074727 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala --- @@ -154,5 +154,9 @@ class RuntimeConfig private[sql](sqlConf: SQLConf = new SQLConf) { if (SQLConf.staticConfKeys.contains(key)) { throw new AnalysisException(s"Cannot modify the value of a static config: $key") } +if (sqlConf.setCommandRejectsSparkConfs && +ConfigEntry.findEntry(key) != null && !SQLConf.sqlConfEntries.containsKey(key)) { --- End diff -- we should only reject configs that are registered as SparkConf. Thinking about configs that are either a SparkConf or SQLConf, we shouldn't reject it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23031 **[Test build #99373 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99373/testReport)** for PR 23031 at commit [`1f703aa`](https://github.com/apache/spark/commit/1f703aaa6597d43654d4fd7d22feb7d2e4b73f9d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23031 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5449/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23031 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22957#discussion_r237073129 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -195,14 +195,26 @@ abstract class Expression extends TreeNode[Expression] { } /** - * Returns true when two expressions will always compute the same result, even if they differ + * Returns true when two expressions will always compute the same output, even if they differ * cosmetically (i.e. capitalization of names in attributes may be different). * * See [[Canonicalize]] for more details. */ def semanticEquals(other: Expression): Boolean = deterministic && other.deterministic && canonicalized == other.canonicalized + /** + * Returns true when two expressions will always compute the same result, even if the output may + * be different, because of different names or similar differences. + * Usually this means they their canonicalized form equals, but it may also not be the case, as + * different output expressions can evaluate to the same result as well (eg. when an expression + * is aliased). + */ + def sameResult(other: Expression): Boolean = other match { +case a: Alias => sameResult(a.child) +case _ => this.semanticEquals(other) --- End diff -- I think it is doable, but I didn't want to put too many `match` where it was not needed. But if you prefer that way I can try and do that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23031: [SPARK-26060][SQL] Track SparkConf entries and ma...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/23031#discussion_r237071928 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala --- @@ -154,5 +154,9 @@ class RuntimeConfig private[sql](sqlConf: SQLConf = new SQLConf) { if (SQLConf.staticConfKeys.contains(key)) { throw new AnalysisException(s"Cannot modify the value of a static config: $key") } +if (sqlConf.setCommandRejectsSparkConfs && +ConfigEntry.findEntry(key) != null && !SQLConf.sqlConfEntries.containsKey(key)) { --- End diff -- I'm sorry for the delay. As per the comment https://github.com/apache/spark/pull/22887#issuecomment-442425557, I'd leave it as is for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22957#discussion_r237070739 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -195,14 +195,26 @@ abstract class Expression extends TreeNode[Expression] { } /** - * Returns true when two expressions will always compute the same result, even if they differ + * Returns true when two expressions will always compute the same output, even if they differ * cosmetically (i.e. capitalization of names in attributes may be different). * * See [[Canonicalize]] for more details. */ def semanticEquals(other: Expression): Boolean = deterministic && other.deterministic && canonicalized == other.canonicalized + /** + * Returns true when two expressions will always compute the same result, even if the output may + * be different, because of different names or similar differences. + * Usually this means they their canonicalized form equals, but it may also not be the case, as + * different output expressions can evaluate to the same result as well (eg. when an expression + * is aliased). + */ + def sameResult(other: Expression): Boolean = other match { +case a: Alias => sameResult(a.child) +case _ => this.semanticEquals(other) --- End diff -- can we also strip the alias of this here? so that we can mark `sameResult` as final. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22957#discussion_r237070496 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -195,14 +195,26 @@ abstract class Expression extends TreeNode[Expression] { } /** - * Returns true when two expressions will always compute the same result, even if they differ + * Returns true when two expressions will always compute the same output, even if they differ * cosmetically (i.e. capitalization of names in attributes may be different). * * See [[Canonicalize]] for more details. */ def semanticEquals(other: Expression): Boolean = deterministic && other.deterministic && canonicalized == other.canonicalized + /** + * Returns true when two expressions will always compute the same result, even if the output may + * be different, because of different names or similar differences. + * Usually this means they their canonicalized form equals, but it may also not be the case, as + * different output expressions can evaluate to the same result as well (eg. when an expression + * is aliased). + */ + def sameResult(other: Expression): Boolean = other match { --- End diff -- Sure, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22957#discussion_r237069486 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -195,14 +195,26 @@ abstract class Expression extends TreeNode[Expression] { } /** - * Returns true when two expressions will always compute the same result, even if they differ + * Returns true when two expressions will always compute the same output, even if they differ * cosmetically (i.e. capitalization of names in attributes may be different). * * See [[Canonicalize]] for more details. */ def semanticEquals(other: Expression): Boolean = deterministic && other.deterministic && canonicalized == other.canonicalized + /** + * Returns true when two expressions will always compute the same result, even if the output may + * be different, because of different names or similar differences. + * Usually this means they their canonicalized form equals, but it may also not be the case, as + * different output expressions can evaluate to the same result as well (eg. when an expression + * is aliased). + */ + def sameResult(other: Expression): Boolean = other match { --- End diff -- can you put it in the method doc(both `semanticEquals` and `sameResult`)? This makes sense to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23153: [SPARK-26147][SQL] only pull out unevaluable python udf ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/23153 a late LGTM as well, thanks @cloud-fan for the patch and thanks @xuanyuanking for the review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22957#discussion_r237068718 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -195,14 +195,26 @@ abstract class Expression extends TreeNode[Expression] { } /** - * Returns true when two expressions will always compute the same result, even if they differ + * Returns true when two expressions will always compute the same output, even if they differ * cosmetically (i.e. capitalization of names in attributes may be different). * * See [[Canonicalize]] for more details. */ def semanticEquals(other: Expression): Boolean = deterministic && other.deterministic && canonicalized == other.canonicalized + /** + * Returns true when two expressions will always compute the same result, even if the output may + * be different, because of different names or similar differences. + * Usually this means they their canonicalized form equals, but it may also not be the case, as + * different output expressions can evaluate to the same result as well (eg. when an expression + * is aliased). + */ + def sameResult(other: Expression): Boolean = other match { --- End diff -- remove `Alias` is not possible for the reason explained in https://github.com/apache/spark/pull/22957#issuecomment-436992955. In general, `semanticEquals` should be used when we want to replace an expression with another, while `sameResult` should be used in order to check that 2 expressions return the same output. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23153: [SPARK-26147][SQL] only pull out unevaluable pyth...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23153 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23052 **[Test build #99372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99372/testReport)** for PR 23052 at commit [`586ab31`](https://github.com/apache/spark/commit/586ab316ed2b9bce07a879dc89766dc854807c21). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23153: [SPARK-26147][SQL] only pull out unevaluable python udf ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23153 thanks, merging to master/2.4! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23164 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5448/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23164 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23120#discussion_r237065584 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala --- @@ -33,26 +33,21 @@ class FailureSafeParser[IN]( private val corruptFieldIndex = schema.getFieldIndex(columnNameOfCorruptRecord) private val actualSchema = StructType(schema.filterNot(_.name == columnNameOfCorruptRecord)) private val resultRow = new GenericInternalRow(schema.length) - private val nullResult = new GenericInternalRow(schema.length) // This function takes 2 parameters: an optional partial result, and the bad record. If the given // schema doesn't contain a field for corrupted record, we just return the partial result or a // row with all fields null. If the given schema contains a field for corrupted record, we will // set the bad record to this field, and set other fields according to the partial result or null. private val toResultRow: (Option[InternalRow], () => UTF8String) => InternalRow = { -if (corruptFieldIndex.isDefined) { - (row, badRecord) => { -var i = 0 -while (i < actualSchema.length) { - val from = actualSchema(i) - resultRow(schema.fieldIndex(from.name)) = row.map(_.get(i, from.dataType)).orNull - i += 1 -} -resultRow(corruptFieldIndex.get) = badRecord() -resultRow +(row, badRecord) => { --- End diff -- For now JSON does not support this. Need additional changes in JacksonParser to return partial results. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22957#discussion_r237065506 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -195,14 +195,26 @@ abstract class Expression extends TreeNode[Expression] { } /** - * Returns true when two expressions will always compute the same result, even if they differ + * Returns true when two expressions will always compute the same output, even if they differ * cosmetically (i.e. capitalization of names in attributes may be different). * * See [[Canonicalize]] for more details. */ def semanticEquals(other: Expression): Boolean = deterministic && other.deterministic && canonicalized == other.canonicalized + /** + * Returns true when two expressions will always compute the same result, even if the output may + * be different, because of different names or similar differences. + * Usually this means they their canonicalized form equals, but it may also not be the case, as + * different output expressions can evaluate to the same result as well (eg. when an expression + * is aliased). + */ + def sameResult(other: Expression): Boolean = other match { --- End diff -- "erase the name" can also mean remove `Alias`. If we can't clearly tell the difference between `semanticEquals` and `sameResult`, and give a guideline about using which one in which case, I think we should just update `semanticEquals`(i.e. `Canonicalize`). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23052 Actually it needs similar changes like in https://github.com/apache/spark/pull/23130 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/23128 Thanks @cloud-fan @gatorsmile @rxin ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23164 **[Test build #99371 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99371/testReport)** for PR 23164 at commit [`0310186`](https://github.com/apache/spark/commit/03101868a72b5ae68bf6324e627f1874af32f040). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23124 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99362/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23124 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23130#discussion_r237062653 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala --- @@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with SharedSQLContext with BeforeAndA assert(e.contains(s"Partition column `$unknown` not found in schema $schemaCatalog")) } } + + test("skip empty files in non bucketed read") { +withTempDir { dir => + val path = dir.getCanonicalPath + Files.write(Paths.get(path, "empty"), Array.empty[Byte]) + Files.write(Paths.get(path, "notEmpty"), "a".getBytes(StandardCharsets.UTF_8)) + val readback = spark.read.option("wholetext", true).text(path) + + assert(readback.rdd.getNumPartitions === 1) --- End diff -- do you mean `wholetext` mode will force to create one partition per file? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23164: [SPARK-26198][SQL] Fix Metadata serialize null va...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/23164 [SPARK-26198][SQL] Fix Metadata serialize null values throw NPE ## What changes were proposed in this pull request? How to reproduce this issue: ```scala scala> val meta = new org.apache.spark.sql.types.MetadataBuilder().putNull("key").build() java.lang.NullPointerException at org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196) at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180) ``` This pr fix `NullPointerException` when `Metadata` serialize `null` values. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-26198 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23164.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23164 commit 03101868a72b5ae68bf6324e627f1874af32f040 Author: Yuming Wang Date: 2018-11-28T12:22:09Z Fix Metadata serialize null values throw NPE --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23083: [SPARK-26114][CORE] ExternalSorter's readingItera...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23083 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/22957#discussion_r237062190 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -195,14 +195,26 @@ abstract class Expression extends TreeNode[Expression] { } /** - * Returns true when two expressions will always compute the same result, even if they differ + * Returns true when two expressions will always compute the same output, even if they differ * cosmetically (i.e. capitalization of names in attributes may be different). * * See [[Canonicalize]] for more details. */ def semanticEquals(other: Expression): Boolean = deterministic && other.deterministic && canonicalized == other.canonicalized + /** + * Returns true when two expressions will always compute the same result, even if the output may + * be different, because of different names or similar differences. + * Usually this means they their canonicalized form equals, but it may also not be the case, as + * different output expressions can evaluate to the same result as well (eg. when an expression + * is aliased). + */ + def sameResult(other: Expression): Boolean = other match { --- End diff -- that is reasonable but it doesn't solve the problem stated in the JIRA. So the goal here is to avoid that something like `a as b` is considered different from `a` in terms of ordering/distribution. If we just erase the name of alias, the 2 expression would still be different because of the presence of `Alias` itself would make the 2 expressions different. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23124 **[Test build #99362 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99362/testReport)** for PR 23124 at commit [`6dff654`](https://github.com/apache/spark/commit/6dff6545f272e0d5117ac17fdc27b686573c5626). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23083: [SPARK-26114][CORE] ExternalSorter's readingIterator fie...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23083 thanks, merging to master/2.4! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23128: [SPARK-26142][SQL] Implement shuffle read metrics...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23128 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23128 thanks , merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23151#discussion_r237059126 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -105,5 +105,16 @@ abstract class SparkFunSuite logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n") } } - + /** + * Creates a temporary directory, which is then passed to `f` and will be deleted after `f` + * returns. + * + * @todo Probably this method should be moved to a more general place + */ + protected def withCreateTempDir(f: File => Unit): Unit = { --- End diff -- if we have this function in `SparkFunSuite`, why do we need to define it again in `SQLTestUtils`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23083: [SPARK-26114][CORE] ExternalSorter's readingIterator fie...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23083 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99360/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23083: [SPARK-26114][CORE] ExternalSorter's readingIterator fie...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23083 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23151#discussion_r237058872 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -105,5 +105,16 @@ abstract class SparkFunSuite logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n") } } - + /** + * Creates a temporary directory, which is then passed to `f` and will be deleted after `f` + * returns. + * + * @todo Probably this method should be moved to a more general place --- End diff -- I think this is the most general place we can find... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23132: [SPARK-26163][SQL] Parsing decimals from JSON usi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23132#discussion_r237058331 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide ## Upgrading From Spark SQL 2.4 to 3.0 + - In Spark version 2.4 and earlier, accepted format of decimals parsed from JSON is an optional sign ('+' or '-'), followed by a sequence of zero or more decimal digits, optionally followed by a fraction, optionally followed by an exponent. Any commas were removed from the input before parsing. Since Spark 3.0, format varies and depends on locale which can be set via JSON option `locale`. The default locale is `en-US`. To switch back to previous behavior, set `spark.sql.legacy.decimalParsing.enabled` to `true`. --- End diff -- I have the same question. Do we need the `DecimalFormat` when locale is `en-US`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23083: [SPARK-26114][CORE] ExternalSorter's readingIterator fie...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23083 **[Test build #99360 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99360/testReport)** for PR 23083 at commit [`1723819`](https://github.com/apache/spark/commit/17238196719de1e68cbcb1eb930cb3176308e437). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22887: [SPARK-25880][CORE] user set's hadoop conf should not ov...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22887 Spark SQL SET command can't update any static config or Spark core configs, but I think hadoop configs are different. It's not static as users can update it via `SparkContext.hadoopConfiguration`. `SparkSession.SessionState.newHadoopConf()` is a mechanism to allow users to set hadoop config per-session in Spark SQL. So it's reasonble for users to expect that, if they set hadoop config via the SQL SET command, it should override the one in `spark-defaults.conf`. Looking back at `appendS3AndSparkHadoopConfigurations`, it has 2 parameters: spark conf and hadoop conf. The spark conf comes from `spark-defaults.conf` and any user provided configs when building the `SparkContext`. The user provided configs override `spark-defaults.conf`. The hadoop conf is either an empty config(if `appendS3AndSparkHadoopConfigurations` is called from `SparkHadoopUtil.newHadoopConfiguration`), or from `SparkSession.SessionState.newHadoopConf()`(if `appendS3AndSparkHadoopConfigurations` is called from `HadoopTableReader`). For the first case, nothing we need to worry about. For the second case, I think the hadoop config should take priority, as it contains the configs specified by users at rutime. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22514 **[Test build #99370 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99370/testReport)** for PR 22514 at commit [`6e2a31b`](https://github.com/apache/spark/commit/6e2a31b0db71ab3dcc8a81cb5ca585f784793736). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22514 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22514 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5447/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22514#discussion_r237052582 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala --- @@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand( override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] = { --- End diff -- I created a Hive CTAS with data source command. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22514 **[Test build #99369 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99369/testReport)** for PR 22514 at commit [`796e964`](https://github.com/apache/spark/commit/796e9640726c6f5ed4be7dd356245ff121f4dea6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22514 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5446/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22514 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23128 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99363/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23128 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23128 **[Test build #99363 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99363/testReport)** for PR 23128 at commit [`8e84c5b`](https://github.com/apache/spark/commit/8e84c5bbfc4b9151310bce84c1506c6aad449011). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23153: [SPARK-26147][SQL] only pull out unevaluable python udf ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23153 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23153: [SPARK-26147][SQL] only pull out unevaluable python udf ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23153 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99358/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23130#discussion_r237050065 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala --- @@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with SharedSQLContext with BeforeAndA assert(e.contains(s"Partition column `$unknown` not found in schema $schemaCatalog")) } } + + test("skip empty files in non bucketed read") { +withTempDir { dir => + val path = dir.getCanonicalPath + Files.write(Paths.get(path, "empty"), Array.empty[Byte]) + Files.write(Paths.get(path, "notEmpty"), "a".getBytes(StandardCharsets.UTF_8)) + val readback = spark.read.option("wholetext", true).text(path) + + assert(readback.rdd.getNumPartitions === 1) --- End diff -- > does this test fail without your change? Yes, it does due to the `wholetext`. > Is JSON the only data source that may return a row for empty file? We depend on underlying parser here. I will check CSV and Text. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23153: [SPARK-26147][SQL] only pull out unevaluable python udf ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23153 **[Test build #99358 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99358/testReport)** for PR 23153 at commit [`7b985d8`](https://github.com/apache/spark/commit/7b985d84cb0fd853d40610b2380313389791298e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22957#discussion_r237049166 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -195,14 +195,26 @@ abstract class Expression extends TreeNode[Expression] { } /** - * Returns true when two expressions will always compute the same result, even if they differ + * Returns true when two expressions will always compute the same output, even if they differ * cosmetically (i.e. capitalization of names in attributes may be different). * * See [[Canonicalize]] for more details. */ def semanticEquals(other: Expression): Boolean = deterministic && other.deterministic && canonicalized == other.canonicalized + /** + * Returns true when two expressions will always compute the same result, even if the output may + * be different, because of different names or similar differences. + * Usually this means they their canonicalized form equals, but it may also not be the case, as + * different output expressions can evaluate to the same result as well (eg. when an expression + * is aliased). + */ + def sameResult(other: Expression): Boolean = other match { --- End diff -- I know it's always safer to introduce a new API, does is it really necessary? In `Canonicalize`, we erase the name for attributes, I think it's reasonable to erase the name of `Alias`, as it doesn't affect the output. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23128 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23128 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99359/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23128 **[Test build #99359 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99359/testReport)** for PR 23128 at commit [`d12ea31`](https://github.com/apache/spark/commit/d12ea311e58e7925f21d343e5de13bfec6737549). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23120#discussion_r237046616 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala --- @@ -33,26 +33,21 @@ class FailureSafeParser[IN]( private val corruptFieldIndex = schema.getFieldIndex(columnNameOfCorruptRecord) private val actualSchema = StructType(schema.filterNot(_.name == columnNameOfCorruptRecord)) private val resultRow = new GenericInternalRow(schema.length) - private val nullResult = new GenericInternalRow(schema.length) // This function takes 2 parameters: an optional partial result, and the bad record. If the given // schema doesn't contain a field for corrupted record, we just return the partial result or a // row with all fields null. If the given schema contains a field for corrupted record, we will // set the bad record to this field, and set other fields according to the partial result or null. private val toResultRow: (Option[InternalRow], () => UTF8String) => InternalRow = { -if (corruptFieldIndex.isDefined) { - (row, badRecord) => { -var i = 0 -while (i < actualSchema.length) { - val from = actualSchema(i) - resultRow(schema.fieldIndex(from.name)) = row.map(_.get(i, from.dataType)).orNull - i += 1 -} -resultRow(corruptFieldIndex.get) = badRecord() -resultRow +(row, badRecord) => { --- End diff -- without this change in `FailureSafeParser`, does JSON support returning partial result? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23120#discussion_r237046251 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala --- @@ -243,21 +243,27 @@ class UnivocityParser( () => getPartialResult(), new RuntimeException("Malformed CSV record")) } else { - try { -// When the length of the returned tokens is identical to the length of the parsed schema, -// we just need to convert the tokens that correspond to the required columns. -var i = 0 -while (i < requiredSchema.length) { + // When the length of the returned tokens is identical to the length of the parsed schema, + // we just need to convert the tokens that correspond to the required columns. + var badRecordException: Option[Throwable] = None + var i = 0 + while (i < requiredSchema.length) { --- End diff -- shall we stop parsing when we hit the first exception? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23130#discussion_r237045706 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala --- @@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with SharedSQLContext with BeforeAndA assert(e.contains(s"Partition column `$unknown` not found in schema $schemaCatalog")) } } + + test("skip empty files in non bucketed read") { +withTempDir { dir => + val path = dir.getCanonicalPath + Files.write(Paths.get(path, "empty"), Array.empty[Byte]) + Files.write(Paths.get(path, "notEmpty"), "a".getBytes(StandardCharsets.UTF_8)) + val readback = spark.read.option("wholetext", true).text(path) + + assert(readback.rdd.getNumPartitions === 1) --- End diff -- does this test fail without your change? IIUC one partition can read multiple files. Is JSON the only data source that may return a row for empty file? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23124 **[Test build #99368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99368/testReport)** for PR 23124 at commit [`72c771e`](https://github.com/apache/spark/commit/72c771e691ae1071742bde5a612593d38f147ff5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23124 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23124 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5445/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIn...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21004#discussion_r237040743 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala --- @@ -126,35 +126,32 @@ abstract class PartitioningAwareFileIndex( val caseInsensitiveOptions = CaseInsensitiveMap(parameters) val timeZoneId = caseInsensitiveOptions.get(DateTimeUtils.TIMEZONE_OPTION) .getOrElse(sparkSession.sessionState.conf.sessionLocalTimeZone) - -userPartitionSchema match { +val inferredPartitionSpec = PartitioningUtils.parsePartitions( + leafDirs, + typeInference = sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled, --- End diff -- Before this patch, there was a subtle difference between with and without a user-provided partition schema: 1. with user-provided partition schema, we should not infer data types. We should infer as string and cast to user-provided type 2. without user-provided partition schema, we should infer the data type(with a config) So it was wrong to unify these 2 code paths. @gengliangwang can you change it back? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23052 > seems like a real failure I am looking at it. It seems the test is not deterministic. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23052 seems like a real failure --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23052 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99361/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23052 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23052 **[Test build #99361 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99361/testReport)** for PR 23052 at commit [`76e1466`](https://github.com/apache/spark/commit/76e1466a39aa2a40d999791bb9d3b09628921e85). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23055#discussion_r237021410 --- Diff: python/pyspark/worker.py --- @@ -22,7 +22,12 @@ import os import sys import time -import resource +# 'resource' is a Unix specific module. +has_resource_module = True +try: +import resource +except ImportError: +has_resource_module = False --- End diff -- @vanzin, I'm going to remove this flag. Does it then look okay to you? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23163 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user c21 commented on the issue: https://github.com/apache/spark/pull/23163 cc people who have most context for review - @cloud-fan, @tejasapatil and @sameeragarwal. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23163 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23100 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23100 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5444/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23163 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23163: [SPARK-26164][SQL] Allow FileFormatWriter to writ...
GitHub user c21 opened a pull request: https://github.com/apache/spark/pull/23163 [SPARK-26164][SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort ## What changes were proposed in this pull request? Currently spark always requires a local sort before writing to output table on partition/bucket columns (see `write.requiredOrdering` in `FileFormatWriter.scala`), which is unnecessary, and can be avoided by keeping multiple output writers concurrently in `FileFormatDataWriter.scala`. This pr is first doing hash-based write, then falling back to sort-based write (current implementation) when number of opened writer exceeding a threshold (controlled by a config). Specifically: 1. (hash-based write) Maintain mapping between file path and output writer, and re-use writer for writing input row. In case of the number of opened output writers exceeding a threshold (can be changed by a config), we go to 2. 2. (sort-based write) Sort the rest of input rows (use the same sorter in SortExec). Then writing the rest of sorted rows, and we can close the writer on the fly, in case no more rows for current file path. ## How was this patch tested? Added unit test in `DataFrameReaderWriterSuite.scala`. Existing test like `SQLMetricsSuite.scala` would already exercise the code path of executor write metrics. You can merge this pull request into a Git repository by running: $ git pull https://github.com/c21/spark more-writers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23163.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23163 commit c2e81eb2f9cdc1b290c098d228d477f325a24101 Author: Cheng Su Date: 2018-11-28T09:46:35Z Allow FileFormatWriter to write multiple partitions/buckets without sort --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/23100 Except for changing `Since` tag, is there any other comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23120: [SPARK-26151][SQL] Return partial results for bad CSV re...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23120 @cloud-fan May I ask you to take a look at the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23100 **[Test build #99367 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99367/testReport)** for PR 23100 at commit [`a451756`](https://github.com/apache/spark/commit/a451756f393c107963b893ba4a74c1d6ade33dd0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23100#discussion_r237010801 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala --- @@ -17,126 +17,512 @@ package org.apache.spark.ml.feature +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException import org.apache.spark.annotation.Since -import org.apache.spark.ml.Transformer +import org.apache.spark.ml.{Estimator, Model} import org.apache.spark.ml.attribute._ import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.param._ -import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols} import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset} -import org.apache.spark.sql.functions.{col, udf} -import org.apache.spark.sql.types.{DoubleType, NumericType, StructType} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, lit, udf} +import org.apache.spark.sql.types.{DoubleType, StructField, StructType} + +/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */ +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data during transform(). + * Options are 'keep' (invalid data presented as an extra categorical feature) or + * 'error' (throw an error). + * Note that this Param is only used during transform; during fitting, invalid data + * will result in an error. + * Default: "error" + * @group param + */ + @Since("2.3.0") --- End diff -- I changed since tag of renamed `OneHotEncoder` to `3.0.0`. Because this `OneHotEncoderBase` is not renamed, I didn't change its since tag. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23100 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23100 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5443/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23100 **[Test build #99366 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99366/testReport)** for PR 23100 at commit [`9e3dab7`](https://github.com/apache/spark/commit/9e3dab7388a6de18b7cf8ddcc8cc2c73a6efea67). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22947: [SPARK-24913][SQL] Make AssertNotNull and AssertTrue non...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22947 any more comments on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22957 cc @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/23057 cc @gatorsmile too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23155: [MINOR][K8S] add missing docs for podTemplateContainerNa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23155 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23155: [MINOR][K8S] add missing docs for podTemplateContainerNa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23155 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5441/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org