[GitHub] spark pull request #21771: [SPARK-24807][CORE] Adding files/jars twice: outp...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21771#discussion_r202531999 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1555,6 +1559,9 @@ class SparkContext(config: SparkConf) extends Logging { Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, hadoopConfiguration, timestamp, useCache = false) postEnvironmentUpdate() +} else { + logWarning(s"The path $path has been added already. Overwriting of added paths " + --- End diff -- I was wondering how common it is for an user to add the same jar expecting it will overwrite since mostly we consider those cases as immutable resources. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21771: [SPARK-24807][CORE] Adding files/jars twice: output a wa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21771 We could have updated the doc for `spark.files.overwrite` too since the confusion probably with this configuration. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21771: [SPARK-24807][CORE] Adding files/jars twice: outp...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21771#discussion_r202531920 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1849,6 +1858,9 @@ class SparkContext(config: SparkConf) extends Logging { if (addedJars.putIfAbsent(key, timestamp).isEmpty) { logInfo(s"Added JAR $path at $key with timestamp $timestamp") postEnvironmentUpdate() +} else { + logWarning(s"The jar $path has been added already. Overwriting of added jars " + --- End diff -- This could confuse what it means with `spark.files.overwrite`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21771: [SPARK-24807][CORE] Adding files/jars twice: outp...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21771#discussion_r202531879 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1555,6 +1559,9 @@ class SparkContext(config: SparkConf) extends Logging { Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, hadoopConfiguration, timestamp, useCache = false) postEnvironmentUpdate() +} else { + logWarning(s"The path $path has been added already. Overwriting of added paths " + --- End diff -- I meant a comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21770 **[Test build #93013 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93013/testReport)** for PR 21770 at commit [`d817f9d`](https://github.com/apache/spark/commit/d817f9d046ad1d6cf63e4f7d929b449cc6d83552). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public final class BufferHolder ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93013/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21740: [SPARK-18230][MLLib]Throw a better exception, if ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/21740#discussion_r202531693 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -75,10 +75,22 @@ class MatrixFactorizationModel @Since("0.8.0") ( } } + /** Check for the invalid user. */ + private def validateUser(user: Int): Unit = { +require(userFeatures.lookup(user).nonEmpty, s"userId: $user not found in the model") --- End diff -- The problem is you now look up the user and product twice. Instead, the callers should probably check whether `userFeatures.lookup(user)` is empty before calling `.head` and otherwise throw the exception. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21740: [SPARK-18230][MLLib]Throw a better exception, if ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/21740#discussion_r202531686 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModelSuite.scala --- @@ -72,6 +72,22 @@ class MatrixFactorizationModelSuite extends SparkFunSuite with MLlibTestSparkCon } } + test("invalid user and product") { +val model = new MatrixFactorizationModel(rank, userFeatures, prodFeatures) +assert(intercept[IllegalArgumentException] { --- End diff -- I don't think it's necessary to assert about the exact message string. Maybe assert it contains the user ID. But just checking for the exception also seems close enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21556 **[Test build #93014 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93014/testReport)** for PR 21556 at commit [`e31c201`](https://github.com/apache/spark/commit/e31c2010fa7cd8ade77691b59940108465df4b54). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21556 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/960/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21556 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21771: [SPARK-24807][CORE] Adding files/jars twice: outp...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21771 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21764: [SPARK-24802] Optimization Rule Exclusion
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21764 ok, thx for the kind explanation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21771: [SPARK-24807][CORE] Adding files/jars twice: outp...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21771#discussion_r202531555 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1555,6 +1559,9 @@ class SparkContext(config: SparkConf) extends Logging { Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, hadoopConfiguration, timestamp, useCache = false) postEnvironmentUpdate() +} else { + logWarning(s"The path $path has been added already. Overwriting of added paths " + --- End diff -- The notes are also reasonable to me. This is a common user error. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21771: [SPARK-24807][CORE] Adding files/jars twice: output a wa...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21771 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21771: [SPARK-24807][CORE] Adding files/jars twice: output a wa...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21771 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21771: [SPARK-24807][CORE] Adding files/jars twice: outp...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21771#discussion_r202531553 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1555,6 +1559,9 @@ class SparkContext(config: SparkConf) extends Logging { Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, hadoopConfiguration, timestamp, useCache = false) postEnvironmentUpdate() +} else { + logWarning(s"The path $path has been added already. Overwriting of added paths " + --- End diff -- We normally do not post the JIRA number in the message. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21770 **[Test build #93013 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93013/testReport)** for PR 21770 at commit [`d817f9d`](https://github.com/apache/spark/commit/d817f9d046ad1d6cf63e4f7d929b449cc6d83552). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/959/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21770 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21764: [SPARK-24802] Optimization Rule Exclusion
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21764 @maropu This is for advanced end users or Spark developers. External conf looks fine, but I have to admit this might be rarely used. BTW, after having this conf, we can deprecate a few internal configurations that are used for disabling specific optimizer rules in SPARK 3.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21768: [SPARK-24776][SQL]Avro unit test: deduplicate cod...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21768 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21711: [SPARK-24681][SQL] Verify nested column names in ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21711#discussion_r202531134 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -138,17 +138,35 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat } /** - * Checks the validity of data column names. Hive metastore disallows the table to use comma in - * data column names. Partition columns do not have such a restriction. Views do not have such - * a restriction. + * Checks the validity of data column names. Hive metastore disallows the table to use some + * special characters (',', ':', and ';') in data column names. Partition columns do not have + * such a restriction. Views do not have such a restriction. */ private def verifyDataSchema( tableName: TableIdentifier, tableType: CatalogTableType, dataSchema: StructType): Unit = { if (tableType != VIEW) { - dataSchema.map(_.name).foreach { colName => -if (colName.contains(",")) { - throw new AnalysisException("Cannot create a table having a column whose name contains " + -s"commas in Hive metastore. Table: $tableName; Column: $colName") + val invalidChars = Seq(",", ":", ";") + def verifyNestedColumnNames(schema: StructType): Unit = schema.foreach { f => +f.dataType match { + case st: StructType => verifyNestedColumnNames(st) + case _ if invalidChars.exists(f.name.contains) => +throw new AnalysisException("Cannot create a table having a nested column whose name " + + s"contains invalid characters (${invalidChars.map(c => s"'$c'").mkString(", ")}) " + --- End diff -- something wrong, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21711: [SPARK-24681][SQL] Verify nested column names in ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21711#discussion_r202531122 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -138,17 +138,35 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat } /** - * Checks the validity of data column names. Hive metastore disallows the table to use comma in - * data column names. Partition columns do not have such a restriction. Views do not have such - * a restriction. + * Checks the validity of data column names. Hive metastore disallows the table to use some + * special characters (',', ':', and ';') in data column names. Partition columns do not have --- End diff -- > in data column names. -> > in data column names, including nested column names. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21711: [SPARK-24681][SQL] Verify nested column names in ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21711#discussion_r202531090 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2005,6 +2005,24 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + test("SPARK-24681 checks if nested column names do not include ',', ':', and ';'") { --- End diff -- Move it to HiveDDLSuite? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93012/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21770 **[Test build #93012 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93012/testReport)** for PR 21770 at commit [`d817f9d`](https://github.com/apache/spark/commit/d817f9d046ad1d6cf63e4f7d929b449cc6d83552). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public final class BufferHolder ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21771: [SPARK-24807][CORE] Adding files/jars twice: outp...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21771#discussion_r202530515 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1555,6 +1559,9 @@ class SparkContext(config: SparkConf) extends Logging { Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, hadoopConfiguration, timestamp, useCache = false) postEnvironmentUpdate() +} else { + logWarning(s"The path $path has been added already. Overwriting of added paths " + --- End diff -- Shell we leave a JIRA link as a comment for example SPARK-16787 and/or SPARK-19417 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21771: [SPARK-24807][CORE] Adding files/jars twice: output a wa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21771 Seems fine to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21771: [SPARK-24807][CORE] Adding files/jars twice: outp...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21771#discussion_r202530492 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1555,6 +1559,9 @@ class SparkContext(config: SparkConf) extends Logging { Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, hadoopConfiguration, timestamp, useCache = false) postEnvironmentUpdate() +} else { + logWarning(s"The path $path has been added already. Overwriting of added paths " + --- End diff -- eh, @MaxGekk, how about we just give warnings without notes for now? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21771: [SPARK-24807][CORE] Adding files/jars twice: outp...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21771#discussion_r202530444 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1555,6 +1559,9 @@ class SparkContext(config: SparkConf) extends Logging { Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, hadoopConfiguration, timestamp, useCache = false) postEnvironmentUpdate() +} else { + logWarning(s"The path $path has been added already. Overwriting of added paths " + --- End diff -- How does it relate to `spark.files.overwrite`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21771: [SPARK-24807][CORE] Adding files/jars twice: outp...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21771#discussion_r202530411 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1555,6 +1559,9 @@ class SparkContext(config: SparkConf) extends Logging { Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, hadoopConfiguration, timestamp, useCache = false) postEnvironmentUpdate() +} else { + logWarning(s"The path $path has been added already. Overwriting of added paths " + --- End diff -- How does it relate to `spark.files.overwrite` and how much likely do we reach here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/958/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21770 **[Test build #93012 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93012/testReport)** for PR 21770 at commit [`d817f9d`](https://github.com/apache/spark/commit/d817f9d046ad1d6cf63e4f7d929b449cc6d83552). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21741 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21770 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21741: [SPARK-24718][SQL] Timestamp support pushdown to parquet...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21741 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21657 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93010/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21657 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21657 **[Test build #93010 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93010/testReport)** for PR 21657 at commit [`dd5bb59`](https://github.com/apache/spark/commit/dd5bb595fe282035e0cf8996cfea15724fe0f674). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93011/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21770 **[Test build #93011 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93011/testReport)** for PR 21770 at commit [`d817f9d`](https://github.com/apache/spark/commit/d817f9d046ad1d6cf63e4f7d929b449cc6d83552). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public final class BufferHolder ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21764: [SPARK-24802] Optimization Rule Exclusion
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21764 @gatorsmile aha, ok. We need to make this option not `internal` but `external`? BTW, the interfaces to add/delete optimizer rules (addition via `ExperimentalMethods` and deletion via `SQLConf`) are different and is this design ok? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21770 Also, `SparkException` needs to extend `RuntimeException` instead of `Exception` because some generated codes do not have code to catch the exception. But, the change causes Mima test failures, so it is removed from this pr now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21770 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/957/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21770 **[Test build #93011 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93011/testReport)** for PR 21770 at commit [`d817f9d`](https://github.com/apache/spark/commit/d817f9d046ad1d6cf63e4f7d929b449cc6d83552). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21766: [SPARK-24803][SQL] add support for numeric
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21766 aha, I didn't know that and postgresql also uses them interchangeably, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21657: [SPARK-24676][SQL] Project required data from CSV...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21657#discussion_r202528977 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -197,15 +202,21 @@ class UnivocityParser( */ def parse(input: String): InternalRow = doParse(input) + private val getToken = if (options.columnPruning) { +(tokens: Array[String], index: Int) => tokens(index) + } else { +(tokens: Array[String], index: Int) => tokens(tokenIndexArr(index)) + } + private def convert(tokens: Array[String]): InternalRow = { -if (tokens.length != schema.length) { +if (tokens.length != parsedSchema.length) { --- End diff -- will do --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21657: [SPARK-24676][SQL] Project required data from CSV...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21657#discussion_r202528933 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -38,24 +38,29 @@ class UnivocityParser( requiredSchema: StructType, val options: CSVOptions) extends Logging { require(requiredSchema.toSet.subsetOf(dataSchema.toSet), -"requiredSchema should be the subset of schema.") +s"requiredSchema (${requiredSchema.catalogString}) should be the subset of " + + s"dataSchema (${dataSchema.catalogString}).") def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + // This index is used to reorder parsed tokens + private val tokenIndexArr = +requiredSchema.map(f => java.lang.Integer.valueOf(dataSchema.indexOf(f))).toArray + val tokenizer = { val parserSetting = options.asParserSettings if (options.columnPruning && requiredSchema.length < dataSchema.length) { - val tokenIndexArr = requiredSchema.map(f => java.lang.Integer.valueOf(dataSchema.indexOf(f))) parserSetting.selectIndexes(tokenIndexArr: _*) } new CsvParser(parserSetting) } - private val schema = if (options.columnPruning) requiredSchema else dataSchema - private val row = new GenericInternalRow(schema.length) + private val parsedSchema = if (options.columnPruning) requiredSchema else dataSchema --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21657: [SPARK-24676][SQL] Project required data from CSV...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21657#discussion_r202528931 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -223,8 +234,8 @@ class UnivocityParser( } else { try { var i = 0 -while (i < schema.length) { - row(i) = valueConverters(i).apply(tokens(i)) +while (i < requiredSchema.length) { --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21657: [SPARK-24676][SQL] Project required data from CSV...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21657#discussion_r202528391 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -197,15 +202,21 @@ class UnivocityParser( */ def parse(input: String): InternalRow = doParse(input) + private val getToken = if (options.columnPruning) { +(tokens: Array[String], index: Int) => tokens(index) + } else { +(tokens: Array[String], index: Int) => tokens(tokenIndexArr(index)) + } + private def convert(tokens: Array[String]): InternalRow = { -if (tokens.length != schema.length) { +if (tokens.length != parsedSchema.length) { --- End diff -- If possible, could you add a test case that satisfy `tokens.length != parsedSchema.length` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21657: [SPARK-24676][SQL] Project required data from CSV...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21657#discussion_r202528267 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -38,24 +38,29 @@ class UnivocityParser( requiredSchema: StructType, val options: CSVOptions) extends Logging { require(requiredSchema.toSet.subsetOf(dataSchema.toSet), -"requiredSchema should be the subset of schema.") +s"requiredSchema (${requiredSchema.catalogString}) should be the subset of " + + s"dataSchema (${dataSchema.catalogString}).") def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + // This index is used to reorder parsed tokens + private val tokenIndexArr = +requiredSchema.map(f => java.lang.Integer.valueOf(dataSchema.indexOf(f))).toArray + val tokenizer = { val parserSetting = options.asParserSettings if (options.columnPruning && requiredSchema.length < dataSchema.length) { - val tokenIndexArr = requiredSchema.map(f => java.lang.Integer.valueOf(dataSchema.indexOf(f))) parserSetting.selectIndexes(tokenIndexArr: _*) } new CsvParser(parserSetting) } - private val schema = if (options.columnPruning) requiredSchema else dataSchema - private val row = new GenericInternalRow(schema.length) + private val parsedSchema = if (options.columnPruning) requiredSchema else dataSchema --- End diff -- Add a comment like > // When column pruning is enabled, the parser only parses the required columns based on their positions in the data schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21657: [SPARK-24676][SQL] Project required data from CSV...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21657#discussion_r202528372 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -223,8 +234,8 @@ class UnivocityParser( } else { try { var i = 0 -while (i < schema.length) { - row(i) = valueConverters(i).apply(tokens(i)) +while (i < requiredSchema.length) { --- End diff -- Add the comment like > // When the length of the returned tokens is identical to the length of the parsed schema, we just need to convert the tokens that correspond to the required columns. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21657: [SPARK-24676][SQL] Project required data from CSV...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21657#discussion_r202527982 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1579,4 +1579,20 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te } } } + + test("SPARK-24676 project required data from parsed data when columnPruning disabled") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "false") { + withTempPath { path => +val dir = path.getAbsolutePath +spark.range(10).selectExpr("id % 2 AS p", "id AS c0", "id AS c1").write.partitionBy("p") + .option("header", "true").csv(dir) +var df = spark.read.option("header", true).csv(dir).selectExpr("sum(p)", "count(c0)") --- End diff -- Normally, we do not use `var` for DataFrame even in test cases --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21657: [SPARK-24676][SQL] Project required data from CSV...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21657#discussion_r202528247 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -38,24 +38,29 @@ class UnivocityParser( requiredSchema: StructType, val options: CSVOptions) extends Logging { require(requiredSchema.toSet.subsetOf(dataSchema.toSet), -"requiredSchema should be the subset of schema.") +s"requiredSchema (${requiredSchema.catalogString}) should be the subset of " + + s"dataSchema (${dataSchema.catalogString}).") def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + // This index is used to reorder parsed tokens + private val tokenIndexArr = +requiredSchema.map(f => java.lang.Integer.valueOf(dataSchema.indexOf(f))).toArray + val tokenizer = { val parserSetting = options.asParserSettings if (options.columnPruning && requiredSchema.length < dataSchema.length) { --- End diff -- Can be simplified to > // When to-be-parsed schema is shorter than the to-be-read data schema, we let Univocity CSV parser select a sequence of fields for reading by their positions. > `if (parsedSchema.length < dataSchema.length)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21657: [SPARK-24676][SQL] Project required data from CSV...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21657#discussion_r202527784 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -38,24 +38,29 @@ class UnivocityParser( requiredSchema: StructType, --- End diff -- Could you add the parameter descriptions of `dataSchema` and `requiredSchema` above class `UnivocityParser`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20611: [SPARK-23425][SQL]Support wildcard in HDFS path f...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20611#discussion_r202528097 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -303,94 +303,44 @@ case class LoadDataCommand( s"partitioned, but a partition spec was provided.") } } - -val loadPath = +val loadPath = { if (isLocal) { -val uri = Utils.resolveURI(path) -val file = new File(uri.getPath) -val exists = if (file.getAbsolutePath.contains("*")) { - val fileSystem = FileSystems.getDefault - val dir = file.getParentFile.getAbsolutePath - if (dir.contains("*")) { -throw new AnalysisException( - s"LOAD DATA input path allows only filename wildcard: $path") - } - - // Note that special characters such as "*" on Windows are not allowed as a path. - // Calling `WindowsFileSystem.getPath` throws an exception if there are in the path. - val dirPath = fileSystem.getPath(dir) - val pathPattern = new File(dirPath.toAbsolutePath.toString, file.getName).toURI.getPath - val safePathPattern = if (Utils.isWindows) { -// On Windows, the pattern should not start with slashes for absolute file paths. -pathPattern.stripPrefix("/") - } else { -pathPattern - } - val files = new File(dir).listFiles() - if (files == null) { -false - } else { -val matcher = fileSystem.getPathMatcher("glob:" + safePathPattern) -files.exists(f => matcher.matches(fileSystem.getPath(f.getAbsolutePath))) - } -} else { - new File(file.getAbsolutePath).exists() -} -if (!exists) { - throw new AnalysisException(s"LOAD DATA input path does not exist: $path") -} -uri +val localFS = FileContext.getLocalFSFileContext() +localFS.makeQualified(new Path(path)) } else { -val uri = new URI(path) -val hdfsUri = if (uri.getScheme() != null && uri.getAuthority() != null) { - uri -} else { - // Follow Hive's behavior: - // If no schema or authority is provided with non-local inpath, - // we will use hadoop configuration "fs.defaultFS". - val defaultFSConf = sparkSession.sessionState.newHadoopConf().get("fs.defaultFS") - val defaultFS = if (defaultFSConf == null) { -new URI("") - } else { -new URI(defaultFSConf) - } - - val scheme = if (uri.getScheme() != null) { --- End diff -- My point is really about using a quasi-private method here. The implementation of that method actually looks like the old code here, although not entirely the same. I'm wondering whether it was therefore based on the implementation of the method you're trying to call already? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21657 **[Test build #93010 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93010/testReport)** for PR 21657 at commit [`dd5bb59`](https://github.com/apache/spark/commit/dd5bb595fe282035e0cf8996cfea15724fe0f674). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21657 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/956/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21657 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21769: [SPARK-24805][SQL] Do not ignore avro files without exte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21769 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93009/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21769: [SPARK-24805][SQL] Do not ignore avro files without exte...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21769 **[Test build #93009 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93009/testReport)** for PR 21769 at commit [`3b75c27`](https://github.com/apache/spark/commit/3b75c2719abc1a6295f9fd64646853b4f1928575). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21769: [SPARK-24805][SQL] Do not ignore avro files without exte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21769 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21657 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21657: [SPARK-24676][SQL] Project required data from CSV parsed...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21657 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21766: [SPARK-24803][SQL] add support for numeric
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21766 They are not exactly the same in ANSI SQL, although MS SQL Server looks like use them interchangeably. https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-2017 > 26) NUMERIC species the data type exact numeric, with the decimal precision and scale specified by the and . > 27) DECIMAL species the data type exact numeric, with the decimal scale specified by the and the implementation-de ned decimal precision equal to or greater than the value of the specified . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21769: [SPARK-24805][SQL] Do not ignore avro files without exte...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21769 **[Test build #93009 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93009/testReport)** for PR 21769 at commit [`3b75c27`](https://github.com/apache/spark/commit/3b75c2719abc1a6295f9fd64646853b4f1928575). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21770 cc @rednaxelafx --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21769: [SPARK-24805][SQL] Do not ignore avro files without exte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21769 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21769: [SPARK-24805][SQL] Do not ignore avro files without exte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21769 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93008/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21769: [SPARK-24805][SQL] Do not ignore avro files without exte...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21769 **[Test build #93008 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93008/testReport)** for PR 21769 at commit [`a7d078e`](https://github.com/apache/spark/commit/a7d078e439a8ea79260377a966dc31c2d3ae38f7). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21769: [SPARK-24805][SQL] Do not ignore avro files without exte...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21769 **[Test build #93008 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93008/testReport)** for PR 21769 at commit [`a7d078e`](https://github.com/apache/spark/commit/a7d078e439a8ea79260377a966dc31c2d3ae38f7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202526423 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala --- @@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat with DataSourceRegister { // Schema evolution is not supported yet. Here we only pick a single random sample file to // figure out the schema of the whole dataset. val sampleFile = - if (conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) { + if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) { --- End diff -- The Hadoop config can be changed like: ```scala spark .sqlContext .sparkContext .hadoopConfiguration .set("avro.mapred.ignore.inputs.without.extension", "true") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/21698 Taking a step back and analyzing the solution for the problem at hand. There are three main issues with the proposal: * It does not solve the problem in a general manner. * I gave example of zip, sample - it applies to any order sensitive closure. * It does not fix the issue when a child stage has one or more completed tasks. * Even if we assume it is a specific fix for repartition/coalasce - even there it does not solve the problem and can cause data loss. * It causes performance regression to existing workaround. * The common workaround for this issue is to checkpoint + action or do a local/global sort (I believe sql does the latter now ?). * The proposal causes performance regression for these existing workarounds. The corner case where the proposal works is if : a) order sensitive stage has finished and b) no task in child stage has finished fetching its shuffle input. This is a fairly narrow subset, and why I dont believe the current approach helps. Having said that, if it is possible to enhance the approach, that would be great ! This is a fairly nasty issue which hurts users, and typically people who are aware of the problem tend to always pay a performance cost to avoid the corner case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21750: [SPARK-24754][ML] Minhash integer overflow
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21750 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21750: [SPARK-24754][ML] Minhash integer overflow
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21750 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202525034 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -623,7 +624,7 @@ class AvroSuite extends SparkFunSuite { spark.read.avro("*/*/*/*/*/*/*/something.avro") } -intercept[FileNotFoundException] { +intercept[java.io.IOException] { TestUtils.withTempDir { dir => FileUtils.touch(new File(dir, "test")) spark.read.avro(dir.toString) --- End diff -- I would actually remove this piece of code from the test, and write a separate test (reading a folder with files without `.avro` extensions) in which the Hadoop's parameter `avro.mapred.ignore.inputs.without.extension` is set to `true` explicitly. Unfortunatelly there is no method like `withSQLConf` for Hadoop's configs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202524884 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala --- @@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat with DataSourceRegister { // Schema evolution is not supported yet. Here we only pick a single random sample file to // figure out the schema of the whole dataset. val sampleFile = - if (conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) { + if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) { --- End diff -- Here is how people use the option so far: https://github.com/databricks/spark-avro/issues/71#issuecomment-280844562 . Probably we should seperatly from the PR discuss how we are going to fix the "bug" and break backward compatibily. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202524696 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -809,4 +810,16 @@ class AvroSuite extends SparkFunSuite { assert(readDf.collect().sameElements(writeDf.collect())) } } + + test("SPARK-24805: reading files without .avro extension") { --- End diff -- Does it just introduce unnesseccary dependency here and overcomplicate the test? I can create small (with just one row) avro file without `.avro` extension especially for the test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21556 I misunderstood how it was safe as well. It was Yuming's clarification that helped. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21769: [SPARK-24805][SQL] Do not ignore avro files witho...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21769#discussion_r202524552 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala --- @@ -64,7 +64,7 @@ private[avro] class AvroFileFormat extends FileFormat with DataSourceRegister { // Schema evolution is not supported yet. Here we only pick a single random sample file to // figure out the schema of the whole dataset. val sampleFile = - if (conf.getBoolean(AvroFileFormat.IgnoreFilesWithoutExtensionProperty, true)) { + if (AvroFileFormat.ignoreFilesWithoutExtensions(conf)) { --- End diff -- The `avro.mapred.ignore.inputs.without.extension` is hadoop's parameter. This PR aims to change the default behavior only. I don't want to convert the hadoop parameter to Avro datasource option here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21771: [SPARK-24807][CORE] Adding files/jars twice: output a wa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21771 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93007/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21771: [SPARK-24807][CORE] Adding files/jars twice: output a wa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21771 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21771: [SPARK-24807][CORE] Adding files/jars twice: output a wa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21771 **[Test build #93007 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93007/testReport)** for PR 21771 at commit [`4a583c6`](https://github.com/apache/spark/commit/4a583c60e6fdf0f98b0d7b5afa4e81af64cb5ebb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21771: [SPARK-24807][CORE] Adding files/jars twice: output a wa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21771 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93005/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21771: [SPARK-24807][CORE] Adding files/jars twice: output a wa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21771 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21771: [SPARK-24807][CORE] Adding files/jars twice: output a wa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21771 **[Test build #93005 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93005/testReport)** for PR 21771 at commit [`0942fd2`](https://github.com/apache/spark/commit/0942fd26e47d685280a85bff465d5ee01708f608). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21741: [SPARK-24718][SQL] Timestamp support pushdown to parquet...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21741 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21741: [SPARK-24718][SQL] Timestamp support pushdown to parquet...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21741 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93006/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21741: [SPARK-24718][SQL] Timestamp support pushdown to parquet...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21741 **[Test build #93006 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93006/testReport)** for PR 21741 at commit [`f206457`](https://github.com/apache/spark/commit/f20645750017e7d78bd3ccbadce8ee579bfaeee8). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` class SerializableConfiguration(@transient var value: Configuration)` * ` class IncompatibleSchemaException(msg: String, ex: Throwable = null) extends Exception(msg, ex)` * ` case class SchemaType(dataType: DataType, nullable: Boolean)` * ` implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T]) ` * ` implicit class AvroDataFrameReader(reader: DataFrameReader) ` * `class KMeansModel (@Since(\"1.0.0\") val clusterCenters: Array[Vector],` * `abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast ` * `case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21102 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93002/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21102 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21102: [SPARK-23913][SQL] Add array_intersect function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21102 **[Test build #93002 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93002/testReport)** for PR 21102 at commit [`5492572`](https://github.com/apache/spark/commit/5492572bb83b314cd31d116d3b344ae3c4596dbd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21103: [SPARK-23915][SQL] Add array_except function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21103 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org