[GitHub] [spark-website] viirya commented on a change in pull request #289: Release 3.0.1
viirya commented on a change in pull request #289: URL: https://github.com/apache/spark-website/pull/289#discussion_r485342723 ## File path: site/developer-tools.html ## @@ -505,8 +505,8 @@ Solving a binary incompatibility For the problem described above, we might add the following: -// [SPARK-zz][CORE] Fix an issue -ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this") +// [SPARK-zz][CORE] Fix an issue +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this") Review comment: I tried it locally. Just run `jekyll build` can reproduce these changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de0dc52 -> 794b48c)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de0dc52 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession add 794b48c [SPARK-32204][SPARK-32182][DOCS][FOLLOW-UP] Use IPython instead of ipython to check if installed in dev/lint-python No new revisions were added by this update. Summary of changes: dev/lint-python | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de0dc52 -> 794b48c)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de0dc52 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession add 794b48c [SPARK-32204][SPARK-32182][DOCS][FOLLOW-UP] Use IPython instead of ipython to check if installed in dev/lint-python No new revisions were added by this update. Summary of changes: dev/lint-python | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de0dc52 -> 794b48c)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de0dc52 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession add 794b48c [SPARK-32204][SPARK-32182][DOCS][FOLLOW-UP] Use IPython instead of ipython to check if installed in dev/lint-python No new revisions were added by this update. Summary of changes: dev/lint-python | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de0dc52 -> 794b48c)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de0dc52 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession add 794b48c [SPARK-32204][SPARK-32182][DOCS][FOLLOW-UP] Use IPython instead of ipython to check if installed in dev/lint-python No new revisions were added by this update. Summary of changes: dev/lint-python | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de0dc52 -> 794b48c)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de0dc52 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession add 794b48c [SPARK-32204][SPARK-32182][DOCS][FOLLOW-UP] Use IPython instead of ipython to check if installed in dev/lint-python No new revisions were added by this update. Summary of changes: dev/lint-python | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 4c0f9d8 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession 4c0f9d8 is described below commit 4c0f9d8b44f63a3d1faaeece8b1d6b47c3bfe75f Author: Liang-Chi Hsieh AuthorDate: Wed Sep 9 12:23:05 2020 +0900 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession ### What changes were proposed in this pull request? If no active SparkSession is available, let `FileSourceScanExec.needsUnsafeRowConversion` look at default SQL config of ParquetSource vectorized reader instead of failing the query execution. ### Why are the changes needed? Fix a bug that if no active SparkSession is available, file-based data source scan for Parquet Source will throw exception. ### Does this PR introduce _any_ user-facing change? Yes, this change fixes the bug. ### How was this patch tested? Unit test. Closes #29667 from viirya/SPARK-32813. Authored-by: Liang-Chi Hsieh Signed-off-by: HyukjinKwon (cherry picked from commit de0dc52a842bf4374c1ae4f9546dd95b3f35c4f1) Signed-off-by: HyukjinKwon --- .../spark/sql/execution/DataSourceScanExec.scala | 2 +- .../spark/sql/execution/SQLExecutionSuite.scala| 40 +- 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala index 447e0a6..0fcb0dd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala @@ -175,7 +175,7 @@ case class FileSourceScanExec( private lazy val needsUnsafeRowConversion: Boolean = { if (relation.fileFormat.isInstanceOf[ParquetSource]) { - SparkSession.getActiveSession.get.sessionState.conf.parquetVectorizedReaderEnabled + sqlContext.conf.parquetVectorizedReaderEnabled } else { false } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala index 8bf7fe6..81e6920 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala @@ -17,11 +17,17 @@ package org.apache.spark.sql.execution +import java.util.concurrent.Executors + import scala.collection.parallel.immutable.ParRange +import scala.concurrent.{ExecutionContext, Future} +import scala.concurrent.duration._ import org.apache.spark.{SparkConf, SparkContext, SparkFunSuite} import org.apache.spark.scheduler.{SparkListener, SparkListenerJobStart} -import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.types._ +import org.apache.spark.util.ThreadUtils class SQLExecutionSuite extends SparkFunSuite { @@ -119,6 +125,38 @@ class SQLExecutionSuite extends SparkFunSuite { spark.stop() } + + test("SPARK-32813: Table scan should work in different thread") { +val executor1 = Executors.newSingleThreadExecutor() +val executor2 = Executors.newSingleThreadExecutor() +var session: SparkSession = null +SparkSession.cleanupAnyExistingSession() + +withTempDir { tempDir => + try { +val tablePath = tempDir.toString + "/table" +val df = ThreadUtils.awaitResult(Future { + session = SparkSession.builder().appName("test").master("local[*]").getOrCreate() + + session.createDataFrame( +session.sparkContext.parallelize(Row(Array(1, 2, 3)) :: Nil), +StructType(Seq( + StructField("a", ArrayType(IntegerType, containsNull = false), nullable = false +.write.parquet(tablePath) + + session.read.parquet(tablePath) +}(ExecutionContext.fromExecutorService(executor1)), 1.minute) + +ThreadUtils.awaitResult(Future { + assert(df.rdd.collect()(0) === Row(Seq(1, 2, 3))) +}(ExecutionContext.fromExecutorService(executor2)), 1.minute) + } finally { +executor1.shutdown() +executor2.shutdown() +session.stop() + } +} + } } object SQLExecutionSuite { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 4c0f9d8 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession 4c0f9d8 is described below commit 4c0f9d8b44f63a3d1faaeece8b1d6b47c3bfe75f Author: Liang-Chi Hsieh AuthorDate: Wed Sep 9 12:23:05 2020 +0900 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession ### What changes were proposed in this pull request? If no active SparkSession is available, let `FileSourceScanExec.needsUnsafeRowConversion` look at default SQL config of ParquetSource vectorized reader instead of failing the query execution. ### Why are the changes needed? Fix a bug that if no active SparkSession is available, file-based data source scan for Parquet Source will throw exception. ### Does this PR introduce _any_ user-facing change? Yes, this change fixes the bug. ### How was this patch tested? Unit test. Closes #29667 from viirya/SPARK-32813. Authored-by: Liang-Chi Hsieh Signed-off-by: HyukjinKwon (cherry picked from commit de0dc52a842bf4374c1ae4f9546dd95b3f35c4f1) Signed-off-by: HyukjinKwon --- .../spark/sql/execution/DataSourceScanExec.scala | 2 +- .../spark/sql/execution/SQLExecutionSuite.scala| 40 +- 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala index 447e0a6..0fcb0dd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala @@ -175,7 +175,7 @@ case class FileSourceScanExec( private lazy val needsUnsafeRowConversion: Boolean = { if (relation.fileFormat.isInstanceOf[ParquetSource]) { - SparkSession.getActiveSession.get.sessionState.conf.parquetVectorizedReaderEnabled + sqlContext.conf.parquetVectorizedReaderEnabled } else { false } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala index 8bf7fe6..81e6920 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala @@ -17,11 +17,17 @@ package org.apache.spark.sql.execution +import java.util.concurrent.Executors + import scala.collection.parallel.immutable.ParRange +import scala.concurrent.{ExecutionContext, Future} +import scala.concurrent.duration._ import org.apache.spark.{SparkConf, SparkContext, SparkFunSuite} import org.apache.spark.scheduler.{SparkListener, SparkListenerJobStart} -import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.types._ +import org.apache.spark.util.ThreadUtils class SQLExecutionSuite extends SparkFunSuite { @@ -119,6 +125,38 @@ class SQLExecutionSuite extends SparkFunSuite { spark.stop() } + + test("SPARK-32813: Table scan should work in different thread") { +val executor1 = Executors.newSingleThreadExecutor() +val executor2 = Executors.newSingleThreadExecutor() +var session: SparkSession = null +SparkSession.cleanupAnyExistingSession() + +withTempDir { tempDir => + try { +val tablePath = tempDir.toString + "/table" +val df = ThreadUtils.awaitResult(Future { + session = SparkSession.builder().appName("test").master("local[*]").getOrCreate() + + session.createDataFrame( +session.sparkContext.parallelize(Row(Array(1, 2, 3)) :: Nil), +StructType(Seq( + StructField("a", ArrayType(IntegerType, containsNull = false), nullable = false +.write.parquet(tablePath) + + session.read.parquet(tablePath) +}(ExecutionContext.fromExecutorService(executor1)), 1.minute) + +ThreadUtils.awaitResult(Future { + assert(df.rdd.collect()(0) === Row(Seq(1, 2, 3))) +}(ExecutionContext.fromExecutorService(executor2)), 1.minute) + } finally { +executor1.shutdown() +executor2.shutdown() +session.stop() + } +} + } } object SQLExecutionSuite { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (514bf56 -> de0dc52)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 514bf56 [SPARK-32823][WEB UI] Fix the master ui resources reporting add de0dc52 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession No new revisions were added by this update. Summary of changes: .../spark/sql/execution/DataSourceScanExec.scala | 2 +- .../spark/sql/execution/SQLExecutionSuite.scala| 40 +- 2 files changed, 40 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 4c0f9d8 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession 4c0f9d8 is described below commit 4c0f9d8b44f63a3d1faaeece8b1d6b47c3bfe75f Author: Liang-Chi Hsieh AuthorDate: Wed Sep 9 12:23:05 2020 +0900 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession ### What changes were proposed in this pull request? If no active SparkSession is available, let `FileSourceScanExec.needsUnsafeRowConversion` look at default SQL config of ParquetSource vectorized reader instead of failing the query execution. ### Why are the changes needed? Fix a bug that if no active SparkSession is available, file-based data source scan for Parquet Source will throw exception. ### Does this PR introduce _any_ user-facing change? Yes, this change fixes the bug. ### How was this patch tested? Unit test. Closes #29667 from viirya/SPARK-32813. Authored-by: Liang-Chi Hsieh Signed-off-by: HyukjinKwon (cherry picked from commit de0dc52a842bf4374c1ae4f9546dd95b3f35c4f1) Signed-off-by: HyukjinKwon --- .../spark/sql/execution/DataSourceScanExec.scala | 2 +- .../spark/sql/execution/SQLExecutionSuite.scala| 40 +- 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala index 447e0a6..0fcb0dd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala @@ -175,7 +175,7 @@ case class FileSourceScanExec( private lazy val needsUnsafeRowConversion: Boolean = { if (relation.fileFormat.isInstanceOf[ParquetSource]) { - SparkSession.getActiveSession.get.sessionState.conf.parquetVectorizedReaderEnabled + sqlContext.conf.parquetVectorizedReaderEnabled } else { false } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala index 8bf7fe6..81e6920 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala @@ -17,11 +17,17 @@ package org.apache.spark.sql.execution +import java.util.concurrent.Executors + import scala.collection.parallel.immutable.ParRange +import scala.concurrent.{ExecutionContext, Future} +import scala.concurrent.duration._ import org.apache.spark.{SparkConf, SparkContext, SparkFunSuite} import org.apache.spark.scheduler.{SparkListener, SparkListenerJobStart} -import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.types._ +import org.apache.spark.util.ThreadUtils class SQLExecutionSuite extends SparkFunSuite { @@ -119,6 +125,38 @@ class SQLExecutionSuite extends SparkFunSuite { spark.stop() } + + test("SPARK-32813: Table scan should work in different thread") { +val executor1 = Executors.newSingleThreadExecutor() +val executor2 = Executors.newSingleThreadExecutor() +var session: SparkSession = null +SparkSession.cleanupAnyExistingSession() + +withTempDir { tempDir => + try { +val tablePath = tempDir.toString + "/table" +val df = ThreadUtils.awaitResult(Future { + session = SparkSession.builder().appName("test").master("local[*]").getOrCreate() + + session.createDataFrame( +session.sparkContext.parallelize(Row(Array(1, 2, 3)) :: Nil), +StructType(Seq( + StructField("a", ArrayType(IntegerType, containsNull = false), nullable = false +.write.parquet(tablePath) + + session.read.parquet(tablePath) +}(ExecutionContext.fromExecutorService(executor1)), 1.minute) + +ThreadUtils.awaitResult(Future { + assert(df.rdd.collect()(0) === Row(Seq(1, 2, 3))) +}(ExecutionContext.fromExecutorService(executor2)), 1.minute) + } finally { +executor1.shutdown() +executor2.shutdown() +session.stop() + } +} + } } object SQLExecutionSuite { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (514bf56 -> de0dc52)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 514bf56 [SPARK-32823][WEB UI] Fix the master ui resources reporting add de0dc52 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession No new revisions were added by this update. Summary of changes: .../spark/sql/execution/DataSourceScanExec.scala | 2 +- .../spark/sql/execution/SQLExecutionSuite.scala| 40 +- 2 files changed, 40 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 4c0f9d8 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession 4c0f9d8 is described below commit 4c0f9d8b44f63a3d1faaeece8b1d6b47c3bfe75f Author: Liang-Chi Hsieh AuthorDate: Wed Sep 9 12:23:05 2020 +0900 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession ### What changes were proposed in this pull request? If no active SparkSession is available, let `FileSourceScanExec.needsUnsafeRowConversion` look at default SQL config of ParquetSource vectorized reader instead of failing the query execution. ### Why are the changes needed? Fix a bug that if no active SparkSession is available, file-based data source scan for Parquet Source will throw exception. ### Does this PR introduce _any_ user-facing change? Yes, this change fixes the bug. ### How was this patch tested? Unit test. Closes #29667 from viirya/SPARK-32813. Authored-by: Liang-Chi Hsieh Signed-off-by: HyukjinKwon (cherry picked from commit de0dc52a842bf4374c1ae4f9546dd95b3f35c4f1) Signed-off-by: HyukjinKwon --- .../spark/sql/execution/DataSourceScanExec.scala | 2 +- .../spark/sql/execution/SQLExecutionSuite.scala| 40 +- 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala index 447e0a6..0fcb0dd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala @@ -175,7 +175,7 @@ case class FileSourceScanExec( private lazy val needsUnsafeRowConversion: Boolean = { if (relation.fileFormat.isInstanceOf[ParquetSource]) { - SparkSession.getActiveSession.get.sessionState.conf.parquetVectorizedReaderEnabled + sqlContext.conf.parquetVectorizedReaderEnabled } else { false } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala index 8bf7fe6..81e6920 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala @@ -17,11 +17,17 @@ package org.apache.spark.sql.execution +import java.util.concurrent.Executors + import scala.collection.parallel.immutable.ParRange +import scala.concurrent.{ExecutionContext, Future} +import scala.concurrent.duration._ import org.apache.spark.{SparkConf, SparkContext, SparkFunSuite} import org.apache.spark.scheduler.{SparkListener, SparkListenerJobStart} -import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.types._ +import org.apache.spark.util.ThreadUtils class SQLExecutionSuite extends SparkFunSuite { @@ -119,6 +125,38 @@ class SQLExecutionSuite extends SparkFunSuite { spark.stop() } + + test("SPARK-32813: Table scan should work in different thread") { +val executor1 = Executors.newSingleThreadExecutor() +val executor2 = Executors.newSingleThreadExecutor() +var session: SparkSession = null +SparkSession.cleanupAnyExistingSession() + +withTempDir { tempDir => + try { +val tablePath = tempDir.toString + "/table" +val df = ThreadUtils.awaitResult(Future { + session = SparkSession.builder().appName("test").master("local[*]").getOrCreate() + + session.createDataFrame( +session.sparkContext.parallelize(Row(Array(1, 2, 3)) :: Nil), +StructType(Seq( + StructField("a", ArrayType(IntegerType, containsNull = false), nullable = false +.write.parquet(tablePath) + + session.read.parquet(tablePath) +}(ExecutionContext.fromExecutorService(executor1)), 1.minute) + +ThreadUtils.awaitResult(Future { + assert(df.rdd.collect()(0) === Row(Seq(1, 2, 3))) +}(ExecutionContext.fromExecutorService(executor2)), 1.minute) + } finally { +executor1.shutdown() +executor2.shutdown() +session.stop() + } +} + } } object SQLExecutionSuite { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (514bf56 -> de0dc52)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 514bf56 [SPARK-32823][WEB UI] Fix the master ui resources reporting add de0dc52 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession No new revisions were added by this update. Summary of changes: .../spark/sql/execution/DataSourceScanExec.scala | 2 +- .../spark/sql/execution/SQLExecutionSuite.scala| 40 +- 2 files changed, 40 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 4c0f9d8 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession 4c0f9d8 is described below commit 4c0f9d8b44f63a3d1faaeece8b1d6b47c3bfe75f Author: Liang-Chi Hsieh AuthorDate: Wed Sep 9 12:23:05 2020 +0900 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession ### What changes were proposed in this pull request? If no active SparkSession is available, let `FileSourceScanExec.needsUnsafeRowConversion` look at default SQL config of ParquetSource vectorized reader instead of failing the query execution. ### Why are the changes needed? Fix a bug that if no active SparkSession is available, file-based data source scan for Parquet Source will throw exception. ### Does this PR introduce _any_ user-facing change? Yes, this change fixes the bug. ### How was this patch tested? Unit test. Closes #29667 from viirya/SPARK-32813. Authored-by: Liang-Chi Hsieh Signed-off-by: HyukjinKwon (cherry picked from commit de0dc52a842bf4374c1ae4f9546dd95b3f35c4f1) Signed-off-by: HyukjinKwon --- .../spark/sql/execution/DataSourceScanExec.scala | 2 +- .../spark/sql/execution/SQLExecutionSuite.scala| 40 +- 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala index 447e0a6..0fcb0dd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala @@ -175,7 +175,7 @@ case class FileSourceScanExec( private lazy val needsUnsafeRowConversion: Boolean = { if (relation.fileFormat.isInstanceOf[ParquetSource]) { - SparkSession.getActiveSession.get.sessionState.conf.parquetVectorizedReaderEnabled + sqlContext.conf.parquetVectorizedReaderEnabled } else { false } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala index 8bf7fe6..81e6920 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala @@ -17,11 +17,17 @@ package org.apache.spark.sql.execution +import java.util.concurrent.Executors + import scala.collection.parallel.immutable.ParRange +import scala.concurrent.{ExecutionContext, Future} +import scala.concurrent.duration._ import org.apache.spark.{SparkConf, SparkContext, SparkFunSuite} import org.apache.spark.scheduler.{SparkListener, SparkListenerJobStart} -import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.types._ +import org.apache.spark.util.ThreadUtils class SQLExecutionSuite extends SparkFunSuite { @@ -119,6 +125,38 @@ class SQLExecutionSuite extends SparkFunSuite { spark.stop() } + + test("SPARK-32813: Table scan should work in different thread") { +val executor1 = Executors.newSingleThreadExecutor() +val executor2 = Executors.newSingleThreadExecutor() +var session: SparkSession = null +SparkSession.cleanupAnyExistingSession() + +withTempDir { tempDir => + try { +val tablePath = tempDir.toString + "/table" +val df = ThreadUtils.awaitResult(Future { + session = SparkSession.builder().appName("test").master("local[*]").getOrCreate() + + session.createDataFrame( +session.sparkContext.parallelize(Row(Array(1, 2, 3)) :: Nil), +StructType(Seq( + StructField("a", ArrayType(IntegerType, containsNull = false), nullable = false +.write.parquet(tablePath) + + session.read.parquet(tablePath) +}(ExecutionContext.fromExecutorService(executor1)), 1.minute) + +ThreadUtils.awaitResult(Future { + assert(df.rdd.collect()(0) === Row(Seq(1, 2, 3))) +}(ExecutionContext.fromExecutorService(executor2)), 1.minute) + } finally { +executor1.shutdown() +executor2.shutdown() +session.stop() + } +} + } } object SQLExecutionSuite { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (514bf56 -> de0dc52)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 514bf56 [SPARK-32823][WEB UI] Fix the master ui resources reporting add de0dc52 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession No new revisions were added by this update. Summary of changes: .../spark/sql/execution/DataSourceScanExec.scala | 2 +- .../spark/sql/execution/SQLExecutionSuite.scala| 40 +- 2 files changed, 40 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (514bf56 -> de0dc52)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 514bf56 [SPARK-32823][WEB UI] Fix the master ui resources reporting add de0dc52 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession No new revisions were added by this update. Summary of changes: .../spark/sql/execution/DataSourceScanExec.scala | 2 +- .../spark/sql/execution/SQLExecutionSuite.scala| 40 +- 2 files changed, 40 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] viirya commented on a change in pull request #289: Release 3.0.1
viirya commented on a change in pull request #289: URL: https://github.com/apache/spark-website/pull/289#discussion_r485312315 ## File path: site/developer-tools.html ## @@ -505,8 +505,8 @@ Solving a binary incompatibility For the problem described above, we might add the following: -// [SPARK-zz][CORE] Fix an issue -ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this") +// [SPARK-zz][CORE] Fix an issue +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this") Review comment: oh ok. then it should be fine, i don't see change in developer-tools related to this, maybe related to jekyll. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] zhengruifeng commented on a change in pull request #289: Release 3.0.1
zhengruifeng commented on a change in pull request #289: URL: https://github.com/apache/spark-website/pull/289#discussion_r485309360 ## File path: site/developer-tools.html ## @@ -505,8 +505,8 @@ Solving a binary incompatibility For the problem described above, we might add the following: -// [SPARK-zz][CORE] Fix an issue -ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this") +// [SPARK-zz][CORE] Fix an issue +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this") Review comment: @viirya those changes in the second commit(3b416ecb027c9262a28ad778d50634d6bd582f81) were made by running `jekyll build` in dir `spark-website`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] viirya commented on a change in pull request #289: Release 3.0.1
viirya commented on a change in pull request #289: URL: https://github.com/apache/spark-website/pull/289#discussion_r485302593 ## File path: site/developer-tools.html ## @@ -505,8 +505,8 @@ Solving a binary incompatibility For the problem described above, we might add the following: -// [SPARK-zz][CORE] Fix an issue -ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this") +// [SPARK-zz][CORE] Fix an issue +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this") Review comment: btw, do we have `class="py"`? I don't see `py` in the css in current site, or I missed it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] viirya commented on a change in pull request #289: Release 3.0.1
viirya commented on a change in pull request #289: URL: https://github.com/apache/spark-website/pull/289#discussion_r485301988 ## File path: site/developer-tools.html ## @@ -505,8 +505,8 @@ Solving a binary incompatibility For the problem described above, we might add the following: -// [SPARK-zz][CORE] Fix an issue -ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this") +// [SPARK-zz][CORE] Fix an issue +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this") Review comment: why `nc` -> `nv`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32823][WEB UI] Fix the master ui resources reporting
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 86b9dd9 [SPARK-32823][WEB UI] Fix the master ui resources reporting 86b9dd9 is described below commit 86b9dd9b414cab9489f00ef7acec5c3264fa312d Author: Thomas Graves AuthorDate: Wed Sep 9 10:33:21 2020 +0900 [SPARK-32823][WEB UI] Fix the master ui resources reporting ### What changes were proposed in this pull request? Fixes the master UI for properly summing the resources total across multiple workers. field: Resources in use: 0 / 8 gpu The bug here is that it was creating MutableResourceInfo and then reducing using the + operator. the + operator in MutableResourceInfo simple adds the address from one to the addresses of the other. But its using a HashSet so if the addresses are the same then you lose the correct amount. ie worker1 has gpu addresses 0,1,2,3 and worker2 has addresses 0,1,2,3 then you only see 4 total GPUs when there are 8. In this case we don't really need to create the MutableResourceInfo at all because we just want the sums for used and total so just remove the use of it. The other uses of it are per Worker so those should be ok. ### Why are the changes needed? fix UI ### Does this PR introduce _any_ user-facing change? UI ### How was this patch tested? tested manually on standalone cluster with multiple workers and multiple GPUs and multiple fpgas Closes #29683 from tgravescs/SPARK-32823. Lead-authored-by: Thomas Graves Co-authored-by: Thomas Graves Signed-off-by: HyukjinKwon (cherry picked from commit 514bf563a7fa1f470b2ab088c0838317500a9aab) Signed-off-by: HyukjinKwon --- .../org/apache/spark/deploy/StandaloneResourceUtils.scala | 10 +- .../scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 10 -- 2 files changed, 9 insertions(+), 11 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala b/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala index e08709e..c7c31a8 100644 --- a/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala +++ b/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala @@ -149,11 +149,11 @@ private[spark] object StandaloneResourceUtils extends Logging { // used for UI def formatResourcesUsed( - resourcesTotal: Map[String, ResourceInformation], - resourcesUsed: Map[String, ResourceInformation]): String = { -resourcesTotal.map { case (rName, rInfo) => - val used = resourcesUsed(rName).addresses.length - val total = rInfo.addresses.length + resourcesTotal: Map[String, Int], + resourcesUsed: Map[String, Int]): String = { +resourcesTotal.map { case (rName, totalSize) => + val used = resourcesUsed(rName) + val total = totalSize s"$used / $total $rName" }.mkString(", ") } diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index f64b449..fcbeba9 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -76,19 +76,17 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { private def formatMasterResourcesInUse(aliveWorkers: Array[WorkerInfo]): String = { val totalInfo = aliveWorkers.map(_.resourcesInfo) - .map(resources => toMutable(resources)) .flatMap(_.toIterator) .groupBy(_._1) // group by resource name .map { case (rName, rInfoArr) => -rName -> rInfoArr.map(_._2).reduce(_ + _) - }.map { case (k, v) => (k, v.toResourceInformation) } + rName -> rInfoArr.map(_._2.addresses.size).sum +} val usedInfo = aliveWorkers.map(_.resourcesInfoUsed) - .map (resources => toMutable(resources)) .flatMap(_.toIterator) .groupBy(_._1) // group by resource name .map { case (rName, rInfoArr) => - rName -> rInfoArr.map(_._2).reduce(_ + _) -}.map { case (k, v) => (k, v.toResourceInformation) } + rName -> rInfoArr.map(_._2.addresses.size).sum +} formatResourcesUsed(totalInfo, usedInfo) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (adc8d68 -> 514bf56)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from adc8d68 [SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV datasources v1 and v2 add 514bf56 [SPARK-32823][WEB UI] Fix the master ui resources reporting No new revisions were added by this update. Summary of changes: .../org/apache/spark/deploy/StandaloneResourceUtils.scala | 10 +- .../scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 10 -- 2 files changed, 9 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32823][WEB UI] Fix the master ui resources reporting
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 86b9dd9 [SPARK-32823][WEB UI] Fix the master ui resources reporting 86b9dd9 is described below commit 86b9dd9b414cab9489f00ef7acec5c3264fa312d Author: Thomas Graves AuthorDate: Wed Sep 9 10:33:21 2020 +0900 [SPARK-32823][WEB UI] Fix the master ui resources reporting ### What changes were proposed in this pull request? Fixes the master UI for properly summing the resources total across multiple workers. field: Resources in use: 0 / 8 gpu The bug here is that it was creating MutableResourceInfo and then reducing using the + operator. the + operator in MutableResourceInfo simple adds the address from one to the addresses of the other. But its using a HashSet so if the addresses are the same then you lose the correct amount. ie worker1 has gpu addresses 0,1,2,3 and worker2 has addresses 0,1,2,3 then you only see 4 total GPUs when there are 8. In this case we don't really need to create the MutableResourceInfo at all because we just want the sums for used and total so just remove the use of it. The other uses of it are per Worker so those should be ok. ### Why are the changes needed? fix UI ### Does this PR introduce _any_ user-facing change? UI ### How was this patch tested? tested manually on standalone cluster with multiple workers and multiple GPUs and multiple fpgas Closes #29683 from tgravescs/SPARK-32823. Lead-authored-by: Thomas Graves Co-authored-by: Thomas Graves Signed-off-by: HyukjinKwon (cherry picked from commit 514bf563a7fa1f470b2ab088c0838317500a9aab) Signed-off-by: HyukjinKwon --- .../org/apache/spark/deploy/StandaloneResourceUtils.scala | 10 +- .../scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 10 -- 2 files changed, 9 insertions(+), 11 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala b/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala index e08709e..c7c31a8 100644 --- a/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala +++ b/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala @@ -149,11 +149,11 @@ private[spark] object StandaloneResourceUtils extends Logging { // used for UI def formatResourcesUsed( - resourcesTotal: Map[String, ResourceInformation], - resourcesUsed: Map[String, ResourceInformation]): String = { -resourcesTotal.map { case (rName, rInfo) => - val used = resourcesUsed(rName).addresses.length - val total = rInfo.addresses.length + resourcesTotal: Map[String, Int], + resourcesUsed: Map[String, Int]): String = { +resourcesTotal.map { case (rName, totalSize) => + val used = resourcesUsed(rName) + val total = totalSize s"$used / $total $rName" }.mkString(", ") } diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index f64b449..fcbeba9 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -76,19 +76,17 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { private def formatMasterResourcesInUse(aliveWorkers: Array[WorkerInfo]): String = { val totalInfo = aliveWorkers.map(_.resourcesInfo) - .map(resources => toMutable(resources)) .flatMap(_.toIterator) .groupBy(_._1) // group by resource name .map { case (rName, rInfoArr) => -rName -> rInfoArr.map(_._2).reduce(_ + _) - }.map { case (k, v) => (k, v.toResourceInformation) } + rName -> rInfoArr.map(_._2.addresses.size).sum +} val usedInfo = aliveWorkers.map(_.resourcesInfoUsed) - .map (resources => toMutable(resources)) .flatMap(_.toIterator) .groupBy(_._1) // group by resource name .map { case (rName, rInfoArr) => - rName -> rInfoArr.map(_._2).reduce(_ + _) -}.map { case (k, v) => (k, v.toResourceInformation) } + rName -> rInfoArr.map(_._2.addresses.size).sum +} formatResourcesUsed(totalInfo, usedInfo) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (adc8d68 -> 514bf56)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from adc8d68 [SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV datasources v1 and v2 add 514bf56 [SPARK-32823][WEB UI] Fix the master ui resources reporting No new revisions were added by this update. Summary of changes: .../org/apache/spark/deploy/StandaloneResourceUtils.scala | 10 +- .../scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 10 -- 2 files changed, 9 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (e8634d8 -> adc8d68)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e8634d8 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config add adc8d68 [SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV datasources v1 and v2 No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/csv/CSVSuite.scala | 13 .../sql/execution/datasources/json/JsonSuite.scala | 13 .../sql/test/DataFrameReaderWriterSuite.scala | 23 -- 3 files changed, 26 insertions(+), 23 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new e86d90b [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config e86d90b is described below commit e86d90b21d4b3d6658b3cb6dd30daafb32b0c1bd Author: Thomas Graves AuthorDate: Wed Sep 9 10:28:40 2020 +0900 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config ### What changes were proposed in this pull request? If the user forgets to specify .amount on a resource config like spark.executor.resource.gpu, the error message thrown is very confusing: ``` ERROR SparkContext: Error initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967) at ``` This makes it so we have a readable error thrown ### Why are the changes needed? confusing error for users ### Does this PR introduce _any_ user-facing change? just error message ### How was this patch tested? Tested manually on standalone cluster Closes #29685 from tgravescs/SPARK-32824. Authored-by: Thomas Graves Signed-off-by: HyukjinKwon (cherry picked from commit e8634d8f6f8548852a284a32c1b7da24bedd8ff7) Signed-off-by: HyukjinKwon --- core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala b/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala index 994b363..16fe897 100644 --- a/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala +++ b/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala @@ -148,7 +148,12 @@ private[spark] object ResourceUtils extends Logging { def listResourceIds(sparkConf: SparkConf, componentName: String): Seq[ResourceID] = { sparkConf.getAllWithPrefix(s"$componentName.$RESOURCE_PREFIX.").map { case (key, _) => - key.substring(0, key.indexOf('.')) + val index = key.indexOf('.') + if (index < 0) { +throw new SparkException(s"You must specify an amount config for resource: $key " + + s"config: $componentName.$RESOURCE_PREFIX.$key") + } + key.substring(0, index) }.distinct.map(name => new ResourceID(componentName, name)) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (96ff87dc -> e8634d8)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 96ff87dc [SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test add e8634d8 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32823][WEB UI] Fix the master ui resources reporting
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 86b9dd9 [SPARK-32823][WEB UI] Fix the master ui resources reporting 86b9dd9 is described below commit 86b9dd9b414cab9489f00ef7acec5c3264fa312d Author: Thomas Graves AuthorDate: Wed Sep 9 10:33:21 2020 +0900 [SPARK-32823][WEB UI] Fix the master ui resources reporting ### What changes were proposed in this pull request? Fixes the master UI for properly summing the resources total across multiple workers. field: Resources in use: 0 / 8 gpu The bug here is that it was creating MutableResourceInfo and then reducing using the + operator. the + operator in MutableResourceInfo simple adds the address from one to the addresses of the other. But its using a HashSet so if the addresses are the same then you lose the correct amount. ie worker1 has gpu addresses 0,1,2,3 and worker2 has addresses 0,1,2,3 then you only see 4 total GPUs when there are 8. In this case we don't really need to create the MutableResourceInfo at all because we just want the sums for used and total so just remove the use of it. The other uses of it are per Worker so those should be ok. ### Why are the changes needed? fix UI ### Does this PR introduce _any_ user-facing change? UI ### How was this patch tested? tested manually on standalone cluster with multiple workers and multiple GPUs and multiple fpgas Closes #29683 from tgravescs/SPARK-32823. Lead-authored-by: Thomas Graves Co-authored-by: Thomas Graves Signed-off-by: HyukjinKwon (cherry picked from commit 514bf563a7fa1f470b2ab088c0838317500a9aab) Signed-off-by: HyukjinKwon --- .../org/apache/spark/deploy/StandaloneResourceUtils.scala | 10 +- .../scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 10 -- 2 files changed, 9 insertions(+), 11 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala b/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala index e08709e..c7c31a8 100644 --- a/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala +++ b/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala @@ -149,11 +149,11 @@ private[spark] object StandaloneResourceUtils extends Logging { // used for UI def formatResourcesUsed( - resourcesTotal: Map[String, ResourceInformation], - resourcesUsed: Map[String, ResourceInformation]): String = { -resourcesTotal.map { case (rName, rInfo) => - val used = resourcesUsed(rName).addresses.length - val total = rInfo.addresses.length + resourcesTotal: Map[String, Int], + resourcesUsed: Map[String, Int]): String = { +resourcesTotal.map { case (rName, totalSize) => + val used = resourcesUsed(rName) + val total = totalSize s"$used / $total $rName" }.mkString(", ") } diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index f64b449..fcbeba9 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -76,19 +76,17 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { private def formatMasterResourcesInUse(aliveWorkers: Array[WorkerInfo]): String = { val totalInfo = aliveWorkers.map(_.resourcesInfo) - .map(resources => toMutable(resources)) .flatMap(_.toIterator) .groupBy(_._1) // group by resource name .map { case (rName, rInfoArr) => -rName -> rInfoArr.map(_._2).reduce(_ + _) - }.map { case (k, v) => (k, v.toResourceInformation) } + rName -> rInfoArr.map(_._2.addresses.size).sum +} val usedInfo = aliveWorkers.map(_.resourcesInfoUsed) - .map (resources => toMutable(resources)) .flatMap(_.toIterator) .groupBy(_._1) // group by resource name .map { case (rName, rInfoArr) => - rName -> rInfoArr.map(_._2).reduce(_ + _) -}.map { case (k, v) => (k, v.toResourceInformation) } + rName -> rInfoArr.map(_._2.addresses.size).sum +} formatResourcesUsed(totalInfo, usedInfo) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (adc8d68 -> 514bf56)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from adc8d68 [SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV datasources v1 and v2 add 514bf56 [SPARK-32823][WEB UI] Fix the master ui resources reporting No new revisions were added by this update. Summary of changes: .../org/apache/spark/deploy/StandaloneResourceUtils.scala | 10 +- .../scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 10 -- 2 files changed, 9 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (e8634d8 -> adc8d68)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e8634d8 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config add adc8d68 [SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV datasources v1 and v2 No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/csv/CSVSuite.scala | 13 .../sql/execution/datasources/json/JsonSuite.scala | 13 .../sql/test/DataFrameReaderWriterSuite.scala | 23 -- 3 files changed, 26 insertions(+), 23 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new e86d90b [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config e86d90b is described below commit e86d90b21d4b3d6658b3cb6dd30daafb32b0c1bd Author: Thomas Graves AuthorDate: Wed Sep 9 10:28:40 2020 +0900 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config ### What changes were proposed in this pull request? If the user forgets to specify .amount on a resource config like spark.executor.resource.gpu, the error message thrown is very confusing: ``` ERROR SparkContext: Error initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967) at ``` This makes it so we have a readable error thrown ### Why are the changes needed? confusing error for users ### Does this PR introduce _any_ user-facing change? just error message ### How was this patch tested? Tested manually on standalone cluster Closes #29685 from tgravescs/SPARK-32824. Authored-by: Thomas Graves Signed-off-by: HyukjinKwon (cherry picked from commit e8634d8f6f8548852a284a32c1b7da24bedd8ff7) Signed-off-by: HyukjinKwon --- core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala b/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala index 994b363..16fe897 100644 --- a/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala +++ b/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala @@ -148,7 +148,12 @@ private[spark] object ResourceUtils extends Logging { def listResourceIds(sparkConf: SparkConf, componentName: String): Seq[ResourceID] = { sparkConf.getAllWithPrefix(s"$componentName.$RESOURCE_PREFIX.").map { case (key, _) => - key.substring(0, key.indexOf('.')) + val index = key.indexOf('.') + if (index < 0) { +throw new SparkException(s"You must specify an amount config for resource: $key " + + s"config: $componentName.$RESOURCE_PREFIX.$key") + } + key.substring(0, index) }.distinct.map(name => new ResourceID(componentName, name)) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (96ff87dc -> e8634d8)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 96ff87dc [SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test add e8634d8 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32823][WEB UI] Fix the master ui resources reporting
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 86b9dd9 [SPARK-32823][WEB UI] Fix the master ui resources reporting 86b9dd9 is described below commit 86b9dd9b414cab9489f00ef7acec5c3264fa312d Author: Thomas Graves AuthorDate: Wed Sep 9 10:33:21 2020 +0900 [SPARK-32823][WEB UI] Fix the master ui resources reporting ### What changes were proposed in this pull request? Fixes the master UI for properly summing the resources total across multiple workers. field: Resources in use: 0 / 8 gpu The bug here is that it was creating MutableResourceInfo and then reducing using the + operator. the + operator in MutableResourceInfo simple adds the address from one to the addresses of the other. But its using a HashSet so if the addresses are the same then you lose the correct amount. ie worker1 has gpu addresses 0,1,2,3 and worker2 has addresses 0,1,2,3 then you only see 4 total GPUs when there are 8. In this case we don't really need to create the MutableResourceInfo at all because we just want the sums for used and total so just remove the use of it. The other uses of it are per Worker so those should be ok. ### Why are the changes needed? fix UI ### Does this PR introduce _any_ user-facing change? UI ### How was this patch tested? tested manually on standalone cluster with multiple workers and multiple GPUs and multiple fpgas Closes #29683 from tgravescs/SPARK-32823. Lead-authored-by: Thomas Graves Co-authored-by: Thomas Graves Signed-off-by: HyukjinKwon (cherry picked from commit 514bf563a7fa1f470b2ab088c0838317500a9aab) Signed-off-by: HyukjinKwon --- .../org/apache/spark/deploy/StandaloneResourceUtils.scala | 10 +- .../scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 10 -- 2 files changed, 9 insertions(+), 11 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala b/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala index e08709e..c7c31a8 100644 --- a/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala +++ b/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala @@ -149,11 +149,11 @@ private[spark] object StandaloneResourceUtils extends Logging { // used for UI def formatResourcesUsed( - resourcesTotal: Map[String, ResourceInformation], - resourcesUsed: Map[String, ResourceInformation]): String = { -resourcesTotal.map { case (rName, rInfo) => - val used = resourcesUsed(rName).addresses.length - val total = rInfo.addresses.length + resourcesTotal: Map[String, Int], + resourcesUsed: Map[String, Int]): String = { +resourcesTotal.map { case (rName, totalSize) => + val used = resourcesUsed(rName) + val total = totalSize s"$used / $total $rName" }.mkString(", ") } diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index f64b449..fcbeba9 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -76,19 +76,17 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { private def formatMasterResourcesInUse(aliveWorkers: Array[WorkerInfo]): String = { val totalInfo = aliveWorkers.map(_.resourcesInfo) - .map(resources => toMutable(resources)) .flatMap(_.toIterator) .groupBy(_._1) // group by resource name .map { case (rName, rInfoArr) => -rName -> rInfoArr.map(_._2).reduce(_ + _) - }.map { case (k, v) => (k, v.toResourceInformation) } + rName -> rInfoArr.map(_._2.addresses.size).sum +} val usedInfo = aliveWorkers.map(_.resourcesInfoUsed) - .map (resources => toMutable(resources)) .flatMap(_.toIterator) .groupBy(_._1) // group by resource name .map { case (rName, rInfoArr) => - rName -> rInfoArr.map(_._2).reduce(_ + _) -}.map { case (k, v) => (k, v.toResourceInformation) } + rName -> rInfoArr.map(_._2.addresses.size).sum +} formatResourcesUsed(totalInfo, usedInfo) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (adc8d68 -> 514bf56)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from adc8d68 [SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV datasources v1 and v2 add 514bf56 [SPARK-32823][WEB UI] Fix the master ui resources reporting No new revisions were added by this update. Summary of changes: .../org/apache/spark/deploy/StandaloneResourceUtils.scala | 10 +- .../scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 10 -- 2 files changed, 9 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] zhengruifeng commented on a change in pull request #289: Release 3.0.1
zhengruifeng commented on a change in pull request #289: URL: https://github.com/apache/spark-website/pull/289#discussion_r485289490 ## File path: releases/_posts/2020-09-08-spark-release-3-0-1.md ## @@ -0,0 +1,52 @@ +--- +layout: post +title: Spark Release 3.0.1 +categories: [] +tags: [] +status: publish +type: post +published: true +meta: + _edit_last: '4' + _wpas_done_all: '1' +--- + +Spark 3.0.1 is a maintenance release containing stability fixes. This release is based on the branch-3.0 maintenance branch of Spark. We strongly recommend all 3.0 users to upgrade to this stable release. + +### Notable changes + - [[SPARK-26905]](https://issues.apache.org/jira/browse/SPARK-26905): Revisit reserved/non-reserved keywords based on the ANSI SQL standard + - [[SPARK-31220]](https://issues.apache.org/jira/browse/SPARK-31220): repartition obeys spark.sql.adaptive.coalescePartitions.initialPartitionNum when spark.sql.adaptive.enabled + - [[SPARK-31703]](https://issues.apache.org/jira/browse/SPARK-31703): Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64) + - [[SPARK-31915]](https://issues.apache.org/jira/browse/SPARK-31915): Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs + - [[SPARK-31923]](https://issues.apache.org/jira/browse/SPARK-31923): Event log cannot be generated when some internal accumulators use unexpected types + - [[SPARK-31935]](https://issues.apache.org/jira/browse/SPARK-31935): Hadoop file system config should be effective in data source options + - [[SPARK-31968]](https://issues.apache.org/jira/browse/SPARK-31968): write.partitionBy() creates duplicate subdirectories when user provides duplicate columns + - [[SPARK-31983]](https://issues.apache.org/jira/browse/SPARK-31983): Tables of structured streaming tab show wrong result for duration column + - [[SPARK-31990]](https://issues.apache.org/jira/browse/SPARK-31990): Streaming's state store compatibility is broken + - [[SPARK-32003]](https://issues.apache.org/jira/browse/SPARK-32003): Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost + - [[SPARK-32038]](https://issues.apache.org/jira/browse/SPARK-32038): Regression in handling NaN values in COUNT(DISTINCT) + - [[SPARK-32073]](https://issues.apache.org/jira/browse/SPARK-32073): Drop R < 3.5 support + - [[SPARK-32092]](https://issues.apache.org/jira/browse/SPARK-32092): CrossvalidatorModel does not save all submodels (it saves only 3) + - [[SPARK-32136]](https://issues.apache.org/jira/browse/SPARK-32136): Spark producing incorrect groupBy results when key is a struct with nullable properties + - [[SPARK-32148]](https://issues.apache.org/jira/browse/SPARK-32148): LEFT JOIN generating non-deterministic and unexpected result (regression in Spark 3.0) + - [[SPARK-32220]](https://issues.apache.org/jira/browse/SPARK-32220): Cartesian Product Hint cause data error + - [[SPARK-32310]](https://issues.apache.org/jira/browse/SPARK-32310): ML params default value parity + - [[SPARK-32339]](https://issues.apache.org/jira/browse/SPARK-32339): Improve MLlib BLAS native acceleration docs + - [[SPARK-32424]](https://issues.apache.org/jira/browse/SPARK-32424): Fix silent data change for timestamp parsing if overflow happens + - [[SPARK-32451]](https://issues.apache.org/jira/browse/SPARK-32451): Support Apache Arrow 1.0.0 in SparkR + - [[SPARK-32456]](https://issues.apache.org/jira/browse/SPARK-32456): Check the Distinct by assuming it as Aggregate for Structured Streaming + - [[SPARK-32608]](https://issues.apache.org/jira/browse/SPARK-32608): Script Transform DELIMIT value should be formatted + - [[SPARK-32646]](https://issues.apache.org/jira/browse/SPARK-32646): ORC predicate pushdown should work with case-insensitive analysis + - [[SPARK-32658]](https://issues.apache.org/jira/browse/SPARK-32658): Partition length number overflow in PartitionWriterStream + - [[SPARK-32676]](https://issues.apache.org/jira/browse/SPARK-32676): Fix double caching in KMeans/BiKMeans + + +### Known issues + - [[SPARK-31511]](https://issues.apache.org/jira/browse/SPARK-31511): Make BytesToBytesMap iterator() thread-safe Review comment: The four known issues here were marked fixed in the tickets. So I add this phrase for them all. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (e8634d8 -> adc8d68)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e8634d8 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config add adc8d68 [SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV datasources v1 and v2 No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/csv/CSVSuite.scala | 13 .../sql/execution/datasources/json/JsonSuite.scala | 13 .../sql/test/DataFrameReaderWriterSuite.scala | 23 -- 3 files changed, 26 insertions(+), 23 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new e86d90b [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config e86d90b is described below commit e86d90b21d4b3d6658b3cb6dd30daafb32b0c1bd Author: Thomas Graves AuthorDate: Wed Sep 9 10:28:40 2020 +0900 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config ### What changes were proposed in this pull request? If the user forgets to specify .amount on a resource config like spark.executor.resource.gpu, the error message thrown is very confusing: ``` ERROR SparkContext: Error initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967) at ``` This makes it so we have a readable error thrown ### Why are the changes needed? confusing error for users ### Does this PR introduce _any_ user-facing change? just error message ### How was this patch tested? Tested manually on standalone cluster Closes #29685 from tgravescs/SPARK-32824. Authored-by: Thomas Graves Signed-off-by: HyukjinKwon (cherry picked from commit e8634d8f6f8548852a284a32c1b7da24bedd8ff7) Signed-off-by: HyukjinKwon --- core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala b/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala index 994b363..16fe897 100644 --- a/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala +++ b/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala @@ -148,7 +148,12 @@ private[spark] object ResourceUtils extends Logging { def listResourceIds(sparkConf: SparkConf, componentName: String): Seq[ResourceID] = { sparkConf.getAllWithPrefix(s"$componentName.$RESOURCE_PREFIX.").map { case (key, _) => - key.substring(0, key.indexOf('.')) + val index = key.indexOf('.') + if (index < 0) { +throw new SparkException(s"You must specify an amount config for resource: $key " + + s"config: $componentName.$RESOURCE_PREFIX.$key") + } + key.substring(0, index) }.distinct.map(name => new ResourceID(componentName, name)) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (96ff87dc -> e8634d8)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 96ff87dc [SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test add e8634d8 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32823][WEB UI] Fix the master ui resources reporting
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 86b9dd9 [SPARK-32823][WEB UI] Fix the master ui resources reporting 86b9dd9 is described below commit 86b9dd9b414cab9489f00ef7acec5c3264fa312d Author: Thomas Graves AuthorDate: Wed Sep 9 10:33:21 2020 +0900 [SPARK-32823][WEB UI] Fix the master ui resources reporting ### What changes were proposed in this pull request? Fixes the master UI for properly summing the resources total across multiple workers. field: Resources in use: 0 / 8 gpu The bug here is that it was creating MutableResourceInfo and then reducing using the + operator. the + operator in MutableResourceInfo simple adds the address from one to the addresses of the other. But its using a HashSet so if the addresses are the same then you lose the correct amount. ie worker1 has gpu addresses 0,1,2,3 and worker2 has addresses 0,1,2,3 then you only see 4 total GPUs when there are 8. In this case we don't really need to create the MutableResourceInfo at all because we just want the sums for used and total so just remove the use of it. The other uses of it are per Worker so those should be ok. ### Why are the changes needed? fix UI ### Does this PR introduce _any_ user-facing change? UI ### How was this patch tested? tested manually on standalone cluster with multiple workers and multiple GPUs and multiple fpgas Closes #29683 from tgravescs/SPARK-32823. Lead-authored-by: Thomas Graves Co-authored-by: Thomas Graves Signed-off-by: HyukjinKwon (cherry picked from commit 514bf563a7fa1f470b2ab088c0838317500a9aab) Signed-off-by: HyukjinKwon --- .../org/apache/spark/deploy/StandaloneResourceUtils.scala | 10 +- .../scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 10 -- 2 files changed, 9 insertions(+), 11 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala b/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala index e08709e..c7c31a8 100644 --- a/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala +++ b/core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala @@ -149,11 +149,11 @@ private[spark] object StandaloneResourceUtils extends Logging { // used for UI def formatResourcesUsed( - resourcesTotal: Map[String, ResourceInformation], - resourcesUsed: Map[String, ResourceInformation]): String = { -resourcesTotal.map { case (rName, rInfo) => - val used = resourcesUsed(rName).addresses.length - val total = rInfo.addresses.length + resourcesTotal: Map[String, Int], + resourcesUsed: Map[String, Int]): String = { +resourcesTotal.map { case (rName, totalSize) => + val used = resourcesUsed(rName) + val total = totalSize s"$used / $total $rName" }.mkString(", ") } diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index f64b449..fcbeba9 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -76,19 +76,17 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { private def formatMasterResourcesInUse(aliveWorkers: Array[WorkerInfo]): String = { val totalInfo = aliveWorkers.map(_.resourcesInfo) - .map(resources => toMutable(resources)) .flatMap(_.toIterator) .groupBy(_._1) // group by resource name .map { case (rName, rInfoArr) => -rName -> rInfoArr.map(_._2).reduce(_ + _) - }.map { case (k, v) => (k, v.toResourceInformation) } + rName -> rInfoArr.map(_._2.addresses.size).sum +} val usedInfo = aliveWorkers.map(_.resourcesInfoUsed) - .map (resources => toMutable(resources)) .flatMap(_.toIterator) .groupBy(_._1) // group by resource name .map { case (rName, rInfoArr) => - rName -> rInfoArr.map(_._2).reduce(_ + _) -}.map { case (k, v) => (k, v.toResourceInformation) } + rName -> rInfoArr.map(_._2.addresses.size).sum +} formatResourcesUsed(totalInfo, usedInfo) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (adc8d68 -> 514bf56)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from adc8d68 [SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV datasources v1 and v2 add 514bf56 [SPARK-32823][WEB UI] Fix the master ui resources reporting No new revisions were added by this update. Summary of changes: .../org/apache/spark/deploy/StandaloneResourceUtils.scala | 10 +- .../scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 10 -- 2 files changed, 9 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (e8634d8 -> adc8d68)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e8634d8 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config add adc8d68 [SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV datasources v1 and v2 No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/csv/CSVSuite.scala | 13 .../sql/execution/datasources/json/JsonSuite.scala | 13 .../sql/test/DataFrameReaderWriterSuite.scala | 23 -- 3 files changed, 26 insertions(+), 23 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new e86d90b [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config e86d90b is described below commit e86d90b21d4b3d6658b3cb6dd30daafb32b0c1bd Author: Thomas Graves AuthorDate: Wed Sep 9 10:28:40 2020 +0900 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config ### What changes were proposed in this pull request? If the user forgets to specify .amount on a resource config like spark.executor.resource.gpu, the error message thrown is very confusing: ``` ERROR SparkContext: Error initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967) at ``` This makes it so we have a readable error thrown ### Why are the changes needed? confusing error for users ### Does this PR introduce _any_ user-facing change? just error message ### How was this patch tested? Tested manually on standalone cluster Closes #29685 from tgravescs/SPARK-32824. Authored-by: Thomas Graves Signed-off-by: HyukjinKwon (cherry picked from commit e8634d8f6f8548852a284a32c1b7da24bedd8ff7) Signed-off-by: HyukjinKwon --- core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala b/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala index 994b363..16fe897 100644 --- a/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala +++ b/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala @@ -148,7 +148,12 @@ private[spark] object ResourceUtils extends Logging { def listResourceIds(sparkConf: SparkConf, componentName: String): Seq[ResourceID] = { sparkConf.getAllWithPrefix(s"$componentName.$RESOURCE_PREFIX.").map { case (key, _) => - key.substring(0, key.indexOf('.')) + val index = key.indexOf('.') + if (index < 0) { +throw new SparkException(s"You must specify an amount config for resource: $key " + + s"config: $componentName.$RESOURCE_PREFIX.$key") + } + key.substring(0, index) }.distinct.map(name => new ResourceID(componentName, name)) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (96ff87dc -> e8634d8)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 96ff87dc [SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test add e8634d8 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (e8634d8 -> adc8d68)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e8634d8 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config add adc8d68 [SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV datasources v1 and v2 No new revisions were added by this update. Summary of changes: .../sql/execution/datasources/csv/CSVSuite.scala | 13 .../sql/execution/datasources/json/JsonSuite.scala | 13 .../sql/test/DataFrameReaderWriterSuite.scala | 23 -- 3 files changed, 26 insertions(+), 23 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new e86d90b [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config e86d90b is described below commit e86d90b21d4b3d6658b3cb6dd30daafb32b0c1bd Author: Thomas Graves AuthorDate: Wed Sep 9 10:28:40 2020 +0900 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config ### What changes were proposed in this pull request? If the user forgets to specify .amount on a resource config like spark.executor.resource.gpu, the error message thrown is very confusing: ``` ERROR SparkContext: Error initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967) at ``` This makes it so we have a readable error thrown ### Why are the changes needed? confusing error for users ### Does this PR introduce _any_ user-facing change? just error message ### How was this patch tested? Tested manually on standalone cluster Closes #29685 from tgravescs/SPARK-32824. Authored-by: Thomas Graves Signed-off-by: HyukjinKwon (cherry picked from commit e8634d8f6f8548852a284a32c1b7da24bedd8ff7) Signed-off-by: HyukjinKwon --- core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala b/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala index 994b363..16fe897 100644 --- a/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala +++ b/core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala @@ -148,7 +148,12 @@ private[spark] object ResourceUtils extends Logging { def listResourceIds(sparkConf: SparkConf, componentName: String): Seq[ResourceID] = { sparkConf.getAllWithPrefix(s"$componentName.$RESOURCE_PREFIX.").map { case (key, _) => - key.substring(0, key.indexOf('.')) + val index = key.indexOf('.') + if (index < 0) { +throw new SparkException(s"You must specify an amount config for resource: $key " + + s"config: $componentName.$RESOURCE_PREFIX.$key") + } + key.substring(0, index) }.distinct.map(name => new ResourceID(componentName, name)) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (96ff87dc -> e8634d8)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 96ff87dc [SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test add e8634d8 [SPARK-32824][CORE] Improve the error message when the user forgets the .amount in a resource config No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (aa87b0a -> 96ff87dc)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from aa87b0a [SPARK-32815][ML] Fix LibSVM data source loading error on file paths with glob metacharacters add 96ff87dc [SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test No new revisions were added by this update. Summary of changes: .../execution/adaptive/AdaptiveQueryExecSuite.scala| 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (aa87b0a -> 96ff87dc)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from aa87b0a [SPARK-32815][ML] Fix LibSVM data source loading error on file paths with glob metacharacters add 96ff87dc [SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test No new revisions were added by this update. Summary of changes: .../execution/adaptive/AdaptiveQueryExecSuite.scala| 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (aa87b0a -> 96ff87dc)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from aa87b0a [SPARK-32815][ML] Fix LibSVM data source loading error on file paths with glob metacharacters add 96ff87dc [SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test No new revisions were added by this update. Summary of changes: .../execution/adaptive/AdaptiveQueryExecSuite.scala| 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (aa87b0a -> 96ff87dc)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from aa87b0a [SPARK-32815][ML] Fix LibSVM data source loading error on file paths with glob metacharacters add 96ff87dc [SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test No new revisions were added by this update. Summary of changes: .../execution/adaptive/AdaptiveQueryExecSuite.scala| 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (aa87b0a -> 96ff87dc)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from aa87b0a [SPARK-32815][ML] Fix LibSVM data source loading error on file paths with glob metacharacters add 96ff87dc [SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test No new revisions were added by this update. Summary of changes: .../execution/adaptive/AdaptiveQueryExecSuite.scala| 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3f20f14 [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes 3f20f14 is described below commit 3f20f1413ccb1f685b96de96e5e705db6a70e058 Author: Wenchen Fan AuthorDate: Wed Sep 9 08:21:59 2020 +0900 [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes ### What changes were proposed in this pull request? This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example, ``` CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v); SELECT t.v FROM ( SELECT v FROM t3 UNION ALL SELECT v + v AS v FROM t3 ) t; org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;; !Project [v#1] <-- the reference got missing +- SubqueryAlias t +- Union :- Project [cast(v#1 as decimal(11,0)) AS v#3] : +- Project [v#1] : +- SubqueryAlias t3 :+- SubqueryAlias tbl : +- LocalRelation [v#1] +- Project [v#2] +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0, DecimalType(11,0), true) AS v#2] +- SubqueryAlias t3 +- SubqueryAlias tbl +- LocalRelation [v#1] ``` In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule. This backport for 3.0 comes from #29485 and #29643 ### Why are the changes needed? bugfixes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests Closes #29680 from maropu/SPARK-32638-BRANCH3.0. Lead-authored-by: Wenchen Fan Co-authored-by: Takeshi Yamamuro Signed-off-by: Takeshi Yamamuro --- .../spark/sql/catalyst/analysis/Analyzer.scala | 109 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 59 +++ .../spark/sql/catalyst/plans/QueryPlan.scala | 85 .../catalyst/plans/logical/AnalysisHelper.scala| 15 ++- .../sql/catalyst/analysis/TypeCoercionSuite.scala | 18 +++- .../src/test/resources/sql-tests/inputs/except.sql | 19 .../resources/sql-tests/inputs/intersect-all.sql | 15 +++ .../src/test/resources/sql-tests/inputs/union.sql | 14 +++ .../resources/sql-tests/results/except.sql.out | 58 ++- .../sql-tests/results/intersect-all.sql.out| 42 +++- .../test/resources/sql-tests/results/union.sql.out | 43 +++- 11 files changed, 348 insertions(+), 129 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 89454c2..729d316 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1239,108 +1239,13 @@ class Analyzer( if (conflictPlans.isEmpty) { right } else { -rewritePlan(right, conflictPlans.toMap)._1 - } -} - -private def rewritePlan(plan: LogicalPlan, conflictPlanMap: Map[LogicalPlan, LogicalPlan]) - : (LogicalPlan, Seq[(Attribute, Attribute)]) = { - if (conflictPlanMap.contains(plan)) { -// If the plan is the one that conflict the with left one, we'd -// just replace it with the new plan and collect the rewrite -// attributes for the parent node. -val newRelation = conflictPlanMap(plan) -newRelation -> plan.output.zip(newRelation.output) - } else { -val attrMapping = new mutable.ArrayBuffer[(Attribute, Attribute)]() -val newPlan = plan.mapChildren { child => - // If not, we'd rewrite child plan recursively until we find the - // conflict node or reach the leaf node. - val (newChild, childAttrMapping) = rewritePlan(child, conflictPlanMap) - attrMapping ++= childAttrMapping.filter { case (oldAttr, _) => -// `attrMapping` is not only used to replace the attributes of the current `plan`, -// but also to be propagated to the parent plans of the current `
[spark] branch branch-3.0 updated: [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3f20f14 [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes 3f20f14 is described below commit 3f20f1413ccb1f685b96de96e5e705db6a70e058 Author: Wenchen Fan AuthorDate: Wed Sep 9 08:21:59 2020 +0900 [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes ### What changes were proposed in this pull request? This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example, ``` CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v); SELECT t.v FROM ( SELECT v FROM t3 UNION ALL SELECT v + v AS v FROM t3 ) t; org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;; !Project [v#1] <-- the reference got missing +- SubqueryAlias t +- Union :- Project [cast(v#1 as decimal(11,0)) AS v#3] : +- Project [v#1] : +- SubqueryAlias t3 :+- SubqueryAlias tbl : +- LocalRelation [v#1] +- Project [v#2] +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0, DecimalType(11,0), true) AS v#2] +- SubqueryAlias t3 +- SubqueryAlias tbl +- LocalRelation [v#1] ``` In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule. This backport for 3.0 comes from #29485 and #29643 ### Why are the changes needed? bugfixes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests Closes #29680 from maropu/SPARK-32638-BRANCH3.0. Lead-authored-by: Wenchen Fan Co-authored-by: Takeshi Yamamuro Signed-off-by: Takeshi Yamamuro --- .../spark/sql/catalyst/analysis/Analyzer.scala | 109 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 59 +++ .../spark/sql/catalyst/plans/QueryPlan.scala | 85 .../catalyst/plans/logical/AnalysisHelper.scala| 15 ++- .../sql/catalyst/analysis/TypeCoercionSuite.scala | 18 +++- .../src/test/resources/sql-tests/inputs/except.sql | 19 .../resources/sql-tests/inputs/intersect-all.sql | 15 +++ .../src/test/resources/sql-tests/inputs/union.sql | 14 +++ .../resources/sql-tests/results/except.sql.out | 58 ++- .../sql-tests/results/intersect-all.sql.out| 42 +++- .../test/resources/sql-tests/results/union.sql.out | 43 +++- 11 files changed, 348 insertions(+), 129 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 89454c2..729d316 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1239,108 +1239,13 @@ class Analyzer( if (conflictPlans.isEmpty) { right } else { -rewritePlan(right, conflictPlans.toMap)._1 - } -} - -private def rewritePlan(plan: LogicalPlan, conflictPlanMap: Map[LogicalPlan, LogicalPlan]) - : (LogicalPlan, Seq[(Attribute, Attribute)]) = { - if (conflictPlanMap.contains(plan)) { -// If the plan is the one that conflict the with left one, we'd -// just replace it with the new plan and collect the rewrite -// attributes for the parent node. -val newRelation = conflictPlanMap(plan) -newRelation -> plan.output.zip(newRelation.output) - } else { -val attrMapping = new mutable.ArrayBuffer[(Attribute, Attribute)]() -val newPlan = plan.mapChildren { child => - // If not, we'd rewrite child plan recursively until we find the - // conflict node or reach the leaf node. - val (newChild, childAttrMapping) = rewritePlan(child, conflictPlanMap) - attrMapping ++= childAttrMapping.filter { case (oldAttr, _) => -// `attrMapping` is not only used to replace the attributes of the current `plan`, -// but also to be propagated to the parent plans of the current `
[spark] branch branch-3.0 updated: [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3f20f14 [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes 3f20f14 is described below commit 3f20f1413ccb1f685b96de96e5e705db6a70e058 Author: Wenchen Fan AuthorDate: Wed Sep 9 08:21:59 2020 +0900 [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes ### What changes were proposed in this pull request? This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example, ``` CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v); SELECT t.v FROM ( SELECT v FROM t3 UNION ALL SELECT v + v AS v FROM t3 ) t; org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;; !Project [v#1] <-- the reference got missing +- SubqueryAlias t +- Union :- Project [cast(v#1 as decimal(11,0)) AS v#3] : +- Project [v#1] : +- SubqueryAlias t3 :+- SubqueryAlias tbl : +- LocalRelation [v#1] +- Project [v#2] +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0, DecimalType(11,0), true) AS v#2] +- SubqueryAlias t3 +- SubqueryAlias tbl +- LocalRelation [v#1] ``` In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule. This backport for 3.0 comes from #29485 and #29643 ### Why are the changes needed? bugfixes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests Closes #29680 from maropu/SPARK-32638-BRANCH3.0. Lead-authored-by: Wenchen Fan Co-authored-by: Takeshi Yamamuro Signed-off-by: Takeshi Yamamuro --- .../spark/sql/catalyst/analysis/Analyzer.scala | 109 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 59 +++ .../spark/sql/catalyst/plans/QueryPlan.scala | 85 .../catalyst/plans/logical/AnalysisHelper.scala| 15 ++- .../sql/catalyst/analysis/TypeCoercionSuite.scala | 18 +++- .../src/test/resources/sql-tests/inputs/except.sql | 19 .../resources/sql-tests/inputs/intersect-all.sql | 15 +++ .../src/test/resources/sql-tests/inputs/union.sql | 14 +++ .../resources/sql-tests/results/except.sql.out | 58 ++- .../sql-tests/results/intersect-all.sql.out| 42 +++- .../test/resources/sql-tests/results/union.sql.out | 43 +++- 11 files changed, 348 insertions(+), 129 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 89454c2..729d316 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1239,108 +1239,13 @@ class Analyzer( if (conflictPlans.isEmpty) { right } else { -rewritePlan(right, conflictPlans.toMap)._1 - } -} - -private def rewritePlan(plan: LogicalPlan, conflictPlanMap: Map[LogicalPlan, LogicalPlan]) - : (LogicalPlan, Seq[(Attribute, Attribute)]) = { - if (conflictPlanMap.contains(plan)) { -// If the plan is the one that conflict the with left one, we'd -// just replace it with the new plan and collect the rewrite -// attributes for the parent node. -val newRelation = conflictPlanMap(plan) -newRelation -> plan.output.zip(newRelation.output) - } else { -val attrMapping = new mutable.ArrayBuffer[(Attribute, Attribute)]() -val newPlan = plan.mapChildren { child => - // If not, we'd rewrite child plan recursively until we find the - // conflict node or reach the leaf node. - val (newChild, childAttrMapping) = rewritePlan(child, conflictPlanMap) - attrMapping ++= childAttrMapping.filter { case (oldAttr, _) => -// `attrMapping` is not only used to replace the attributes of the current `plan`, -// but also to be propagated to the parent plans of the current `
[spark] branch branch-3.0 updated: [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3f20f14 [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes 3f20f14 is described below commit 3f20f1413ccb1f685b96de96e5e705db6a70e058 Author: Wenchen Fan AuthorDate: Wed Sep 9 08:21:59 2020 +0900 [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes ### What changes were proposed in this pull request? This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example, ``` CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v); SELECT t.v FROM ( SELECT v FROM t3 UNION ALL SELECT v + v AS v FROM t3 ) t; org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;; !Project [v#1] <-- the reference got missing +- SubqueryAlias t +- Union :- Project [cast(v#1 as decimal(11,0)) AS v#3] : +- Project [v#1] : +- SubqueryAlias t3 :+- SubqueryAlias tbl : +- LocalRelation [v#1] +- Project [v#2] +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0, DecimalType(11,0), true) AS v#2] +- SubqueryAlias t3 +- SubqueryAlias tbl +- LocalRelation [v#1] ``` In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule. This backport for 3.0 comes from #29485 and #29643 ### Why are the changes needed? bugfixes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests Closes #29680 from maropu/SPARK-32638-BRANCH3.0. Lead-authored-by: Wenchen Fan Co-authored-by: Takeshi Yamamuro Signed-off-by: Takeshi Yamamuro --- .../spark/sql/catalyst/analysis/Analyzer.scala | 109 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 59 +++ .../spark/sql/catalyst/plans/QueryPlan.scala | 85 .../catalyst/plans/logical/AnalysisHelper.scala| 15 ++- .../sql/catalyst/analysis/TypeCoercionSuite.scala | 18 +++- .../src/test/resources/sql-tests/inputs/except.sql | 19 .../resources/sql-tests/inputs/intersect-all.sql | 15 +++ .../src/test/resources/sql-tests/inputs/union.sql | 14 +++ .../resources/sql-tests/results/except.sql.out | 58 ++- .../sql-tests/results/intersect-all.sql.out| 42 +++- .../test/resources/sql-tests/results/union.sql.out | 43 +++- 11 files changed, 348 insertions(+), 129 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 89454c2..729d316 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1239,108 +1239,13 @@ class Analyzer( if (conflictPlans.isEmpty) { right } else { -rewritePlan(right, conflictPlans.toMap)._1 - } -} - -private def rewritePlan(plan: LogicalPlan, conflictPlanMap: Map[LogicalPlan, LogicalPlan]) - : (LogicalPlan, Seq[(Attribute, Attribute)]) = { - if (conflictPlanMap.contains(plan)) { -// If the plan is the one that conflict the with left one, we'd -// just replace it with the new plan and collect the rewrite -// attributes for the parent node. -val newRelation = conflictPlanMap(plan) -newRelation -> plan.output.zip(newRelation.output) - } else { -val attrMapping = new mutable.ArrayBuffer[(Attribute, Attribute)]() -val newPlan = plan.mapChildren { child => - // If not, we'd rewrite child plan recursively until we find the - // conflict node or reach the leaf node. - val (newChild, childAttrMapping) = rewritePlan(child, conflictPlanMap) - attrMapping ++= childAttrMapping.filter { case (oldAttr, _) => -// `attrMapping` is not only used to replace the attributes of the current `plan`, -// but also to be propagated to the parent plans of the current `
[spark] branch branch-3.0 updated: [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3f20f14 [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes 3f20f14 is described below commit 3f20f1413ccb1f685b96de96e5e705db6a70e058 Author: Wenchen Fan AuthorDate: Wed Sep 9 08:21:59 2020 +0900 [SPARK-32638][SQL][3.0] Corrects references when adding aliases in WidenSetOperationTypes ### What changes were proposed in this pull request? This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example, ``` CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v); SELECT t.v FROM ( SELECT v FROM t3 UNION ALL SELECT v + v AS v FROM t3 ) t; org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;; !Project [v#1] <-- the reference got missing +- SubqueryAlias t +- Union :- Project [cast(v#1 as decimal(11,0)) AS v#3] : +- Project [v#1] : +- SubqueryAlias t3 :+- SubqueryAlias tbl : +- LocalRelation [v#1] +- Project [v#2] +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0, DecimalType(11,0), true) AS v#2] +- SubqueryAlias t3 +- SubqueryAlias tbl +- LocalRelation [v#1] ``` In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule. This backport for 3.0 comes from #29485 and #29643 ### Why are the changes needed? bugfixes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests Closes #29680 from maropu/SPARK-32638-BRANCH3.0. Lead-authored-by: Wenchen Fan Co-authored-by: Takeshi Yamamuro Signed-off-by: Takeshi Yamamuro --- .../spark/sql/catalyst/analysis/Analyzer.scala | 109 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 59 +++ .../spark/sql/catalyst/plans/QueryPlan.scala | 85 .../catalyst/plans/logical/AnalysisHelper.scala| 15 ++- .../sql/catalyst/analysis/TypeCoercionSuite.scala | 18 +++- .../src/test/resources/sql-tests/inputs/except.sql | 19 .../resources/sql-tests/inputs/intersect-all.sql | 15 +++ .../src/test/resources/sql-tests/inputs/union.sql | 14 +++ .../resources/sql-tests/results/except.sql.out | 58 ++- .../sql-tests/results/intersect-all.sql.out| 42 +++- .../test/resources/sql-tests/results/union.sql.out | 43 +++- 11 files changed, 348 insertions(+), 129 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 89454c2..729d316 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1239,108 +1239,13 @@ class Analyzer( if (conflictPlans.isEmpty) { right } else { -rewritePlan(right, conflictPlans.toMap)._1 - } -} - -private def rewritePlan(plan: LogicalPlan, conflictPlanMap: Map[LogicalPlan, LogicalPlan]) - : (LogicalPlan, Seq[(Attribute, Attribute)]) = { - if (conflictPlanMap.contains(plan)) { -// If the plan is the one that conflict the with left one, we'd -// just replace it with the new plan and collect the rewrite -// attributes for the parent node. -val newRelation = conflictPlanMap(plan) -newRelation -> plan.output.zip(newRelation.output) - } else { -val attrMapping = new mutable.ArrayBuffer[(Attribute, Attribute)]() -val newPlan = plan.mapChildren { child => - // If not, we'd rewrite child plan recursively until we find the - // conflict node or reach the leaf node. - val (newChild, childAttrMapping) = rewritePlan(child, conflictPlanMap) - attrMapping ++= childAttrMapping.filter { case (oldAttr, _) => -// `attrMapping` is not only used to replace the attributes of the current `plan`, -// but also to be propagated to the parent plans of the current `
[spark] branch branch-2.4 updated: [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new ef24542 [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters ef24542 is described below commit ef24542721bd1107339c6efef8f14a970e429384 Author: Max Gekk AuthorDate: Tue Sep 8 14:17:18 2020 + [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of https://github.com/apache/spark/pull/29675, and it is based on another bug fix for CSV/JSON datasources https://github.com/apache/spark/pull/29663. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29678 from MaxGekk/globbing-paths-when-inferring-schema-ml-2.4. Authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../apache/spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala| 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 18 ++ 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index 39dcd91..5795812 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -99,7 +99,7 @@ private[libsvm] class LibSVMFileFormat "though the input. If you know the number in advance, please specify it via " + "'numFeatures' option to avoid the extra scan.") - val paths = files.map(_.getPath.toUri.toString) + val paths = files.map(_.getPath.toString) val parsed = MLUtils.parseLibSVMFile(sparkSession, paths) MLUtils.computeNumFeatures(parsed) } diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 14af8b5..c8550cd 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -110,7 +110,8 @@ object MLUtils extends Logging { DataSource.apply( sparkSession, paths = paths, -className = classOf[TextFileFormat].getName +className = classOf[TextFileFormat].getName, +options = Map(DataSource.GLOB_PATHS_KEY -> "false") ).resolveRelation(checkFilesExist = false)) .select("value") diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala index 3eabff4..28c770c 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala @@ -184,4 +184,22 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + test("SPARK-32815: Test LibSVM data source on file paths with glob metacharacters") { +val basePath = Utils.createDirectory(tempDir.getCanonicalPath, "globbing") +// test libsvm writer / reader without specifying schema +
[spark] branch branch-3.0 updated (9b39e4b -> 8c0b9cb)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 9b39e4b [SPARK-32753][SQL][3.0] Only copy tags to node with no tags add 8c0b9cb [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters No new revisions were added by this update. Summary of changes: .../spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala | 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 20 3 files changed, 23 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new ef24542 [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters ef24542 is described below commit ef24542721bd1107339c6efef8f14a970e429384 Author: Max Gekk AuthorDate: Tue Sep 8 14:17:18 2020 + [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of https://github.com/apache/spark/pull/29675, and it is based on another bug fix for CSV/JSON datasources https://github.com/apache/spark/pull/29663. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29678 from MaxGekk/globbing-paths-when-inferring-schema-ml-2.4. Authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../apache/spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala| 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 18 ++ 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index 39dcd91..5795812 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -99,7 +99,7 @@ private[libsvm] class LibSVMFileFormat "though the input. If you know the number in advance, please specify it via " + "'numFeatures' option to avoid the extra scan.") - val paths = files.map(_.getPath.toUri.toString) + val paths = files.map(_.getPath.toString) val parsed = MLUtils.parseLibSVMFile(sparkSession, paths) MLUtils.computeNumFeatures(parsed) } diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 14af8b5..c8550cd 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -110,7 +110,8 @@ object MLUtils extends Logging { DataSource.apply( sparkSession, paths = paths, -className = classOf[TextFileFormat].getName +className = classOf[TextFileFormat].getName, +options = Map(DataSource.GLOB_PATHS_KEY -> "false") ).resolveRelation(checkFilesExist = false)) .select("value") diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala index 3eabff4..28c770c 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala @@ -184,4 +184,22 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + test("SPARK-32815: Test LibSVM data source on file paths with glob metacharacters") { +val basePath = Utils.createDirectory(tempDir.getCanonicalPath, "globbing") +// test libsvm writer / reader without specifying schema +
[spark] branch branch-3.0 updated: [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 8c0b9cb [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters 8c0b9cb is described below commit 8c0b9cbf68693db22314637a75f28e5aa954aff8 Author: Max Gekk AuthorDate: Tue Sep 8 14:16:13 2020 + [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of https://github.com/apache/spark/pull/29670, and it is based on another bug fix for CSV/JSON datasources https://github.com/apache/spark/pull/29659. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29675 from MaxGekk/globbing-paths-when-inferring-schema-ml-3.0. Authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala | 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 20 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index da8f3a24f..11be1d8 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -100,7 +100,7 @@ private[libsvm] class LibSVMFileFormat "though the input. If you know the number in advance, please specify it via " + "'numFeatures' option to avoid the extra scan.") - val paths = files.map(_.getPath.toUri.toString) + val paths = files.map(_.getPath.toString) val parsed = MLUtils.parseLibSVMFile(sparkSession, paths) MLUtils.computeNumFeatures(parsed) } diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 9198334..2411300 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -110,7 +110,8 @@ object MLUtils extends Logging { DataSource.apply( sparkSession, paths = paths, -className = classOf[TextFileFormat].getName +className = classOf[TextFileFormat].getName, +options = Map(DataSource.GLOB_PATHS_KEY -> "false") ).resolveRelation(checkFilesExist = false)) .select("value") diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala index 263ad26..0999892 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala @@ -191,4 +191,24 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + test("SPARK-32815: Test LibSVM data source on file paths with glob metacharacters") { +withTempDir { dir => + val basePath = dir.getCanonicalPath + // test libsvm writer / reader without specifying schema + val svm
[spark] branch master updated (e7d9a245 -> aa87b0a)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e7d9a245 [SPARK-32817][SQL] DPP throws error when broadcast side is empty add aa87b0a [SPARK-32815][ML] Fix LibSVM data source loading error on file paths with glob metacharacters No new revisions were added by this update. Summary of changes: .../spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala | 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 20 3 files changed, 23 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new ef24542 [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters ef24542 is described below commit ef24542721bd1107339c6efef8f14a970e429384 Author: Max Gekk AuthorDate: Tue Sep 8 14:17:18 2020 + [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of https://github.com/apache/spark/pull/29675, and it is based on another bug fix for CSV/JSON datasources https://github.com/apache/spark/pull/29663. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29678 from MaxGekk/globbing-paths-when-inferring-schema-ml-2.4. Authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../apache/spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala| 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 18 ++ 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index 39dcd91..5795812 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -99,7 +99,7 @@ private[libsvm] class LibSVMFileFormat "though the input. If you know the number in advance, please specify it via " + "'numFeatures' option to avoid the extra scan.") - val paths = files.map(_.getPath.toUri.toString) + val paths = files.map(_.getPath.toString) val parsed = MLUtils.parseLibSVMFile(sparkSession, paths) MLUtils.computeNumFeatures(parsed) } diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 14af8b5..c8550cd 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -110,7 +110,8 @@ object MLUtils extends Logging { DataSource.apply( sparkSession, paths = paths, -className = classOf[TextFileFormat].getName +className = classOf[TextFileFormat].getName, +options = Map(DataSource.GLOB_PATHS_KEY -> "false") ).resolveRelation(checkFilesExist = false)) .select("value") diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala index 3eabff4..28c770c 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala @@ -184,4 +184,22 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + test("SPARK-32815: Test LibSVM data source on file paths with glob metacharacters") { +val basePath = Utils.createDirectory(tempDir.getCanonicalPath, "globbing") +// test libsvm writer / reader without specifying schema +
[spark] branch branch-3.0 updated: [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 8c0b9cb [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters 8c0b9cb is described below commit 8c0b9cbf68693db22314637a75f28e5aa954aff8 Author: Max Gekk AuthorDate: Tue Sep 8 14:16:13 2020 + [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of https://github.com/apache/spark/pull/29670, and it is based on another bug fix for CSV/JSON datasources https://github.com/apache/spark/pull/29659. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29675 from MaxGekk/globbing-paths-when-inferring-schema-ml-3.0. Authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala | 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 20 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index da8f3a24f..11be1d8 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -100,7 +100,7 @@ private[libsvm] class LibSVMFileFormat "though the input. If you know the number in advance, please specify it via " + "'numFeatures' option to avoid the extra scan.") - val paths = files.map(_.getPath.toUri.toString) + val paths = files.map(_.getPath.toString) val parsed = MLUtils.parseLibSVMFile(sparkSession, paths) MLUtils.computeNumFeatures(parsed) } diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 9198334..2411300 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -110,7 +110,8 @@ object MLUtils extends Logging { DataSource.apply( sparkSession, paths = paths, -className = classOf[TextFileFormat].getName +className = classOf[TextFileFormat].getName, +options = Map(DataSource.GLOB_PATHS_KEY -> "false") ).resolveRelation(checkFilesExist = false)) .select("value") diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala index 263ad26..0999892 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala @@ -191,4 +191,24 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + test("SPARK-32815: Test LibSVM data source on file paths with glob metacharacters") { +withTempDir { dir => + val basePath = dir.getCanonicalPath + // test libsvm writer / reader without specifying schema + val svm
[spark] branch master updated (e7d9a245 -> aa87b0a)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e7d9a245 [SPARK-32817][SQL] DPP throws error when broadcast side is empty add aa87b0a [SPARK-32815][ML] Fix LibSVM data source loading error on file paths with glob metacharacters No new revisions were added by this update. Summary of changes: .../spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala | 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 20 3 files changed, 23 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new ef24542 [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters ef24542 is described below commit ef24542721bd1107339c6efef8f14a970e429384 Author: Max Gekk AuthorDate: Tue Sep 8 14:17:18 2020 + [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of https://github.com/apache/spark/pull/29675, and it is based on another bug fix for CSV/JSON datasources https://github.com/apache/spark/pull/29663. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29678 from MaxGekk/globbing-paths-when-inferring-schema-ml-2.4. Authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../apache/spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala| 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 18 ++ 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index 39dcd91..5795812 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -99,7 +99,7 @@ private[libsvm] class LibSVMFileFormat "though the input. If you know the number in advance, please specify it via " + "'numFeatures' option to avoid the extra scan.") - val paths = files.map(_.getPath.toUri.toString) + val paths = files.map(_.getPath.toString) val parsed = MLUtils.parseLibSVMFile(sparkSession, paths) MLUtils.computeNumFeatures(parsed) } diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 14af8b5..c8550cd 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -110,7 +110,8 @@ object MLUtils extends Logging { DataSource.apply( sparkSession, paths = paths, -className = classOf[TextFileFormat].getName +className = classOf[TextFileFormat].getName, +options = Map(DataSource.GLOB_PATHS_KEY -> "false") ).resolveRelation(checkFilesExist = false)) .select("value") diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala index 3eabff4..28c770c 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala @@ -184,4 +184,22 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + test("SPARK-32815: Test LibSVM data source on file paths with glob metacharacters") { +val basePath = Utils.createDirectory(tempDir.getCanonicalPath, "globbing") +// test libsvm writer / reader without specifying schema +
[spark] branch branch-3.0 updated: [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 8c0b9cb [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters 8c0b9cb is described below commit 8c0b9cbf68693db22314637a75f28e5aa954aff8 Author: Max Gekk AuthorDate: Tue Sep 8 14:16:13 2020 + [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of https://github.com/apache/spark/pull/29670, and it is based on another bug fix for CSV/JSON datasources https://github.com/apache/spark/pull/29659. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29675 from MaxGekk/globbing-paths-when-inferring-schema-ml-3.0. Authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala | 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 20 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index da8f3a24f..11be1d8 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -100,7 +100,7 @@ private[libsvm] class LibSVMFileFormat "though the input. If you know the number in advance, please specify it via " + "'numFeatures' option to avoid the extra scan.") - val paths = files.map(_.getPath.toUri.toString) + val paths = files.map(_.getPath.toString) val parsed = MLUtils.parseLibSVMFile(sparkSession, paths) MLUtils.computeNumFeatures(parsed) } diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 9198334..2411300 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -110,7 +110,8 @@ object MLUtils extends Logging { DataSource.apply( sparkSession, paths = paths, -className = classOf[TextFileFormat].getName +className = classOf[TextFileFormat].getName, +options = Map(DataSource.GLOB_PATHS_KEY -> "false") ).resolveRelation(checkFilesExist = false)) .select("value") diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala index 263ad26..0999892 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala @@ -191,4 +191,24 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + test("SPARK-32815: Test LibSVM data source on file paths with glob metacharacters") { +withTempDir { dir => + val basePath = dir.getCanonicalPath + // test libsvm writer / reader without specifying schema + val svm
[spark] branch master updated (e7d9a245 -> aa87b0a)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e7d9a245 [SPARK-32817][SQL] DPP throws error when broadcast side is empty add aa87b0a [SPARK-32815][ML] Fix LibSVM data source loading error on file paths with glob metacharacters No new revisions were added by this update. Summary of changes: .../spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala | 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 20 3 files changed, 23 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new ef24542 [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters ef24542 is described below commit ef24542721bd1107339c6efef8f14a970e429384 Author: Max Gekk AuthorDate: Tue Sep 8 14:17:18 2020 + [SPARK-32815][ML][2.4] Fix LibSVM data source loading error on file paths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of https://github.com/apache/spark/pull/29675, and it is based on another bug fix for CSV/JSON datasources https://github.com/apache/spark/pull/29663. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29678 from MaxGekk/globbing-paths-when-inferring-schema-ml-2.4. Authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../apache/spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala| 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 18 ++ 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index 39dcd91..5795812 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -99,7 +99,7 @@ private[libsvm] class LibSVMFileFormat "though the input. If you know the number in advance, please specify it via " + "'numFeatures' option to avoid the extra scan.") - val paths = files.map(_.getPath.toUri.toString) + val paths = files.map(_.getPath.toString) val parsed = MLUtils.parseLibSVMFile(sparkSession, paths) MLUtils.computeNumFeatures(parsed) } diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 14af8b5..c8550cd 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -110,7 +110,8 @@ object MLUtils extends Logging { DataSource.apply( sparkSession, paths = paths, -className = classOf[TextFileFormat].getName +className = classOf[TextFileFormat].getName, +options = Map(DataSource.GLOB_PATHS_KEY -> "false") ).resolveRelation(checkFilesExist = false)) .select("value") diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala index 3eabff4..28c770c 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala @@ -184,4 +184,22 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + test("SPARK-32815: Test LibSVM data source on file paths with glob metacharacters") { +val basePath = Utils.createDirectory(tempDir.getCanonicalPath, "globbing") +// test libsvm writer / reader without specifying schema +
[spark] branch branch-3.0 updated: [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 8c0b9cb [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters 8c0b9cb is described below commit 8c0b9cbf68693db22314637a75f28e5aa954aff8 Author: Max Gekk AuthorDate: Tue Sep 8 14:16:13 2020 + [SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file paths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of https://github.com/apache/spark/pull/29670, and it is based on another bug fix for CSV/JSON datasources https://github.com/apache/spark/pull/29659. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-0-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29675 from MaxGekk/globbing-paths-when-inferring-schema-ml-3.0. Authored-by: Max Gekk Signed-off-by: Wenchen Fan --- .../spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala | 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 20 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala index da8f3a24f..11be1d8 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala @@ -100,7 +100,7 @@ private[libsvm] class LibSVMFileFormat "though the input. If you know the number in advance, please specify it via " + "'numFeatures' option to avoid the extra scan.") - val paths = files.map(_.getPath.toUri.toString) + val paths = files.map(_.getPath.toString) val parsed = MLUtils.parseLibSVMFile(sparkSession, paths) MLUtils.computeNumFeatures(parsed) } diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 9198334..2411300 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -110,7 +110,8 @@ object MLUtils extends Logging { DataSource.apply( sparkSession, paths = paths, -className = classOf[TextFileFormat].getName +className = classOf[TextFileFormat].getName, +options = Map(DataSource.GLOB_PATHS_KEY -> "false") ).resolveRelation(checkFilesExist = false)) .select("value") diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala index 263ad26..0999892 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala @@ -191,4 +191,24 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + test("SPARK-32815: Test LibSVM data source on file paths with glob metacharacters") { +withTempDir { dir => + val basePath = dir.getCanonicalPath + // test libsvm writer / reader without specifying schema + val svm
[spark] branch master updated (e7d9a245 -> aa87b0a)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e7d9a245 [SPARK-32817][SQL] DPP throws error when broadcast side is empty add aa87b0a [SPARK-32815][ML] Fix LibSVM data source loading error on file paths with glob metacharacters No new revisions were added by this update. Summary of changes: .../spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala | 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 20 3 files changed, 23 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (e7d9a245 -> aa87b0a)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e7d9a245 [SPARK-32817][SQL] DPP throws error when broadcast side is empty add aa87b0a [SPARK-32815][ML] Fix LibSVM data source loading error on file paths with glob metacharacters No new revisions were added by this update. Summary of changes: .../spark/ml/source/libsvm/LibSVMRelation.scala | 2 +- .../scala/org/apache/spark/mllib/util/MLUtils.scala | 3 ++- .../spark/ml/source/libsvm/LibSVMRelationSuite.scala | 20 3 files changed, 23 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32753][SQL][3.0] Only copy tags to node with no tags
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 9b39e4b [SPARK-32753][SQL][3.0] Only copy tags to node with no tags 9b39e4b is described below commit 9b39e4b7aefcedf09764b3528a32bdf0b77e331b Author: manuzhang AuthorDate: Tue Sep 8 13:36:05 2020 + [SPARK-32753][SQL][3.0] Only copy tags to node with no tags This PR backports https://github.com/apache/spark/pull/29593 to branch-3.0 ### What changes were proposed in this pull request? Only copy tags to node with no tags when transforming plans. ### Why are the changes needed? cloud-fan [made a good point](https://github.com/apache/spark/pull/29593#discussion_r482013121) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE. ``` spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan) // With AQE [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) // Without AQE [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) ``` It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags. ### Does this PR introduce any user-facing change? Yes. Fix a bug. ### How was this patch tested? Add test. Closes #29665 from manuzhang/spark-32753-3.0. Authored-by: manuzhang Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/catalyst/trees/TreeNode.scala | 7 ++- .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala | 16 +++- 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala index c4a1067..4c74742 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala @@ -91,7 +91,12 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { private val tags: mutable.Map[TreeNodeTag[_], Any] = mutable.Map.empty protected def copyTagsFrom(other: BaseType): Unit = { -tags ++= other.tags +// SPARK-32753: it only makes sense to copy tags to a new node +// but it's too expensive to detect other cases likes node removal +// so we make a compromise here to copy tags to node with no tags +if (tags.isEmpty) { + tags ++= other.tags +} } def setTagValue[T](tag: TreeNodeTag[T], value: T): Unit = { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala index e18adbd..6d97a6b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan} import org.apache.spark.sql.execution.{PartialReducerPartitionSpec, ReusedSubqueryExec, ShuffledRowRDD, SparkPlan} import org.apache.spark.sql.execution.adaptive.OptimizeLocalShuffleReader.LOCAL_SHUFFLE_READER_DESCRIPTION import org.apache.spark.sql.execution.command.DataWritingCommandExec -import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, Exchange, ReusedExchangeExec} +import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, Exchange, ReusedExchangeExec, ShuffleExchangeExec} import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec, BuildRight, SortMergeJoinExec} import org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecu
[spark] branch branch-3.0 updated: [SPARK-32753][SQL][3.0] Only copy tags to node with no tags
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 9b39e4b [SPARK-32753][SQL][3.0] Only copy tags to node with no tags 9b39e4b is described below commit 9b39e4b7aefcedf09764b3528a32bdf0b77e331b Author: manuzhang AuthorDate: Tue Sep 8 13:36:05 2020 + [SPARK-32753][SQL][3.0] Only copy tags to node with no tags This PR backports https://github.com/apache/spark/pull/29593 to branch-3.0 ### What changes were proposed in this pull request? Only copy tags to node with no tags when transforming plans. ### Why are the changes needed? cloud-fan [made a good point](https://github.com/apache/spark/pull/29593#discussion_r482013121) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE. ``` spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan) // With AQE [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) // Without AQE [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) ``` It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags. ### Does this PR introduce any user-facing change? Yes. Fix a bug. ### How was this patch tested? Add test. Closes #29665 from manuzhang/spark-32753-3.0. Authored-by: manuzhang Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/catalyst/trees/TreeNode.scala | 7 ++- .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala | 16 +++- 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala index c4a1067..4c74742 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala @@ -91,7 +91,12 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { private val tags: mutable.Map[TreeNodeTag[_], Any] = mutable.Map.empty protected def copyTagsFrom(other: BaseType): Unit = { -tags ++= other.tags +// SPARK-32753: it only makes sense to copy tags to a new node +// but it's too expensive to detect other cases likes node removal +// so we make a compromise here to copy tags to node with no tags +if (tags.isEmpty) { + tags ++= other.tags +} } def setTagValue[T](tag: TreeNodeTag[T], value: T): Unit = { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala index e18adbd..6d97a6b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan} import org.apache.spark.sql.execution.{PartialReducerPartitionSpec, ReusedSubqueryExec, ShuffledRowRDD, SparkPlan} import org.apache.spark.sql.execution.adaptive.OptimizeLocalShuffleReader.LOCAL_SHUFFLE_READER_DESCRIPTION import org.apache.spark.sql.execution.command.DataWritingCommandExec -import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, Exchange, ReusedExchangeExec} +import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, Exchange, ReusedExchangeExec, ShuffleExchangeExec} import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec, BuildRight, SortMergeJoinExec} import org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecu
[spark] branch branch-3.0 updated: [SPARK-32753][SQL][3.0] Only copy tags to node with no tags
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 9b39e4b [SPARK-32753][SQL][3.0] Only copy tags to node with no tags 9b39e4b is described below commit 9b39e4b7aefcedf09764b3528a32bdf0b77e331b Author: manuzhang AuthorDate: Tue Sep 8 13:36:05 2020 + [SPARK-32753][SQL][3.0] Only copy tags to node with no tags This PR backports https://github.com/apache/spark/pull/29593 to branch-3.0 ### What changes were proposed in this pull request? Only copy tags to node with no tags when transforming plans. ### Why are the changes needed? cloud-fan [made a good point](https://github.com/apache/spark/pull/29593#discussion_r482013121) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE. ``` spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan) // With AQE [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) // Without AQE [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) ``` It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags. ### Does this PR introduce any user-facing change? Yes. Fix a bug. ### How was this patch tested? Add test. Closes #29665 from manuzhang/spark-32753-3.0. Authored-by: manuzhang Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/catalyst/trees/TreeNode.scala | 7 ++- .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala | 16 +++- 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala index c4a1067..4c74742 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala @@ -91,7 +91,12 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { private val tags: mutable.Map[TreeNodeTag[_], Any] = mutable.Map.empty protected def copyTagsFrom(other: BaseType): Unit = { -tags ++= other.tags +// SPARK-32753: it only makes sense to copy tags to a new node +// but it's too expensive to detect other cases likes node removal +// so we make a compromise here to copy tags to node with no tags +if (tags.isEmpty) { + tags ++= other.tags +} } def setTagValue[T](tag: TreeNodeTag[T], value: T): Unit = { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala index e18adbd..6d97a6b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan} import org.apache.spark.sql.execution.{PartialReducerPartitionSpec, ReusedSubqueryExec, ShuffledRowRDD, SparkPlan} import org.apache.spark.sql.execution.adaptive.OptimizeLocalShuffleReader.LOCAL_SHUFFLE_READER_DESCRIPTION import org.apache.spark.sql.execution.command.DataWritingCommandExec -import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, Exchange, ReusedExchangeExec} +import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, Exchange, ReusedExchangeExec, ShuffleExchangeExec} import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec, BuildRight, SortMergeJoinExec} import org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecu
[spark] branch branch-3.0 updated: [SPARK-32753][SQL][3.0] Only copy tags to node with no tags
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 9b39e4b [SPARK-32753][SQL][3.0] Only copy tags to node with no tags 9b39e4b is described below commit 9b39e4b7aefcedf09764b3528a32bdf0b77e331b Author: manuzhang AuthorDate: Tue Sep 8 13:36:05 2020 + [SPARK-32753][SQL][3.0] Only copy tags to node with no tags This PR backports https://github.com/apache/spark/pull/29593 to branch-3.0 ### What changes were proposed in this pull request? Only copy tags to node with no tags when transforming plans. ### Why are the changes needed? cloud-fan [made a good point](https://github.com/apache/spark/pull/29593#discussion_r482013121) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE. ``` spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan) // With AQE [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) // Without AQE [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) ``` It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags. ### Does this PR introduce any user-facing change? Yes. Fix a bug. ### How was this patch tested? Add test. Closes #29665 from manuzhang/spark-32753-3.0. Authored-by: manuzhang Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/catalyst/trees/TreeNode.scala | 7 ++- .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala | 16 +++- 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala index c4a1067..4c74742 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala @@ -91,7 +91,12 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { private val tags: mutable.Map[TreeNodeTag[_], Any] = mutable.Map.empty protected def copyTagsFrom(other: BaseType): Unit = { -tags ++= other.tags +// SPARK-32753: it only makes sense to copy tags to a new node +// but it's too expensive to detect other cases likes node removal +// so we make a compromise here to copy tags to node with no tags +if (tags.isEmpty) { + tags ++= other.tags +} } def setTagValue[T](tag: TreeNodeTag[T], value: T): Unit = { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala index e18adbd..6d97a6b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan} import org.apache.spark.sql.execution.{PartialReducerPartitionSpec, ReusedSubqueryExec, ShuffledRowRDD, SparkPlan} import org.apache.spark.sql.execution.adaptive.OptimizeLocalShuffleReader.LOCAL_SHUFFLE_READER_DESCRIPTION import org.apache.spark.sql.execution.command.DataWritingCommandExec -import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, Exchange, ReusedExchangeExec} +import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, Exchange, ReusedExchangeExec, ShuffleExchangeExec} import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec, BuildRight, SortMergeJoinExec} import org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecu
[spark] branch branch-3.0 updated (4656ee5 -> 9b39e4b)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 4656ee5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe add 9b39e4b [SPARK-32753][SQL][3.0] Only copy tags to node with no tags No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/catalyst/trees/TreeNode.scala | 7 ++- .../sql/execution/adaptive/AdaptiveQueryExecSuite.scala | 16 +++- 2 files changed, 21 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (bd3dc2f5 -> e7d9a245)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from bd3dc2f5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe add e7d9a245 [SPARK-32817][SQL] DPP throws error when broadcast side is empty No new revisions were added by this update. Summary of changes: .../spark/sql/execution/joins/HashedRelation.scala | 2 +- .../apache/spark/sql/DynamicPartitionPruningSuite.scala | 17 + .../spark/sql/execution/joins/HashedRelationSuite.scala | 6 +- 3 files changed, 23 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (bd3dc2f5 -> e7d9a245)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from bd3dc2f5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe add e7d9a245 [SPARK-32817][SQL] DPP throws error when broadcast side is empty No new revisions were added by this update. Summary of changes: .../spark/sql/execution/joins/HashedRelation.scala | 2 +- .../apache/spark/sql/DynamicPartitionPruningSuite.scala | 17 + .../spark/sql/execution/joins/HashedRelationSuite.scala | 6 +- 3 files changed, 23 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (bd3dc2f5 -> e7d9a245)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from bd3dc2f5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe add e7d9a245 [SPARK-32817][SQL] DPP throws error when broadcast side is empty No new revisions were added by this update. Summary of changes: .../spark/sql/execution/joins/HashedRelation.scala | 2 +- .../apache/spark/sql/DynamicPartitionPruningSuite.scala | 17 + .../spark/sql/execution/joins/HashedRelationSuite.scala | 6 +- 3 files changed, 23 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (bd3dc2f5 -> e7d9a245)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from bd3dc2f5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe add e7d9a245 [SPARK-32817][SQL] DPP throws error when broadcast side is empty No new revisions were added by this update. Summary of changes: .../spark/sql/execution/joins/HashedRelation.scala | 2 +- .../apache/spark/sql/DynamicPartitionPruningSuite.scala | 17 + .../spark/sql/execution/joins/HashedRelationSuite.scala | 6 +- 3 files changed, 23 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-32817][SQL] DPP throws error when broadcast side is empty
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e7d9a245 [SPARK-32817][SQL] DPP throws error when broadcast side is empty e7d9a245 is described below commit e7d9a245656655e7bb1df3e04df30eb3cc9e23ad Author: Zhenhua Wang AuthorDate: Tue Sep 8 21:36:21 2020 +0900 [SPARK-32817][SQL] DPP throws error when broadcast side is empty ### What changes were proposed in this pull request? In `SubqueryBroadcastExec.relationFuture`, if the `broadcastRelation` is an `EmptyHashedRelation`, then `broadcastRelation.keys()` will throw `UnsupportedOperationException`. ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a new test. Closes #29671 from wzhfy/dpp_empty_broadcast. Authored-by: Zhenhua Wang Signed-off-by: Takeshi Yamamuro --- .../spark/sql/execution/joins/HashedRelation.scala | 2 +- .../apache/spark/sql/DynamicPartitionPruningSuite.scala | 17 + .../spark/sql/execution/joins/HashedRelationSuite.scala | 6 +- 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala index 89836f6..3c5ed40 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala @@ -1091,7 +1091,7 @@ case object EmptyHashedRelation extends HashedRelation { override def keyIsUnique: Boolean = true override def keys(): Iterator[InternalRow] = { -throw new UnsupportedOperationException +Iterator.empty } override def close(): Unit = {} diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala index ba61be5..55437aa 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala @@ -1344,6 +1344,23 @@ abstract class DynamicPartitionPruningSuiteBase } } } + + test("SPARK-32817: DPP throws error when the broadcast side is empty") { +withSQLConf( + SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true", + SQLConf.DYNAMIC_PARTITION_PRUNING_REUSE_BROADCAST_ONLY.key -> "true") { + val df = sql( +""" + |SELECT * FROM fact_sk f + |JOIN dim_store s + |ON f.store_id = s.store_id WHERE s.country = 'XYZ' +""".stripMargin) + + checkPartitionPruningPredicate(df, false, true) + + checkAnswer(df, Nil) +} + } } class DynamicPartitionPruningSuiteAEOff extends DynamicPartitionPruningSuiteBase { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala index caa7bdf..84f6299 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala @@ -621,7 +621,7 @@ class HashedRelationSuite extends SharedSparkSession { } } - test("EmptyHashedRelation return null in get / getValue") { + test("EmptyHashedRelation override methods behavior test") { val buildKey = Seq(BoundReference(0, LongType, false)) val hashed = HashedRelation(Seq.empty[InternalRow].toIterator, buildKey, 1, mm) assert(hashed == EmptyHashedRelation) @@ -631,6 +631,10 @@ class HashedRelationSuite extends SharedSparkSession { assert(hashed.get(key) == null) assert(hashed.getValue(0L) == null) assert(hashed.getValue(key) == null) + +assert(hashed.keys().isEmpty) +assert(hashed.keyIsUnique) +assert(hashed.estimatedSize == 0) } test("SPARK-32399: test methods related to key index") { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (3b32ddf -> 4656ee5)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 3b32ddf [SPARK-32785][SQL][3.0] Interval with dangling parts should not result null add 4656ee5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe No new revisions were added by this update. Summary of changes: .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 1 file changed, 39 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (3b32ddf -> 4656ee5)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 3b32ddf [SPARK-32785][SQL][3.0] Interval with dangling parts should not result null add 4656ee5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe No new revisions were added by this update. Summary of changes: .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 1 file changed, 39 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (55d38a4 -> bd3dc2f5)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 55d38a4 [SPARK-32748][SQL] Revert "Support local property propagation in SubqueryBroadcastExec" add bd3dc2f5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe No new revisions were added by this update. Summary of changes: .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 1 file changed, 39 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (3b32ddf -> 4656ee5)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 3b32ddf [SPARK-32785][SQL][3.0] Interval with dangling parts should not result null add 4656ee5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe No new revisions were added by this update. Summary of changes: .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 1 file changed, 39 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (55d38a4 -> bd3dc2f5)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 55d38a4 [SPARK-32748][SQL] Revert "Support local property propagation in SubqueryBroadcastExec" add bd3dc2f5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe No new revisions were added by this update. Summary of changes: .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 1 file changed, 39 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (3b32ddf -> 4656ee5)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 3b32ddf [SPARK-32785][SQL][3.0] Interval with dangling parts should not result null add 4656ee5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe No new revisions were added by this update. Summary of changes: .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 1 file changed, 39 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (55d38a4 -> bd3dc2f5)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 55d38a4 [SPARK-32748][SQL] Revert "Support local property propagation in SubqueryBroadcastExec" add bd3dc2f5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe No new revisions were added by this update. Summary of changes: .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 1 file changed, 39 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (3b32ddf -> 4656ee5)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 3b32ddf [SPARK-32785][SQL][3.0] Interval with dangling parts should not result null add 4656ee5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe No new revisions were added by this update. Summary of changes: .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 1 file changed, 39 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (55d38a4 -> bd3dc2f5)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 55d38a4 [SPARK-32748][SQL] Revert "Support local property propagation in SubqueryBroadcastExec" add bd3dc2f5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe No new revisions were added by this update. Summary of changes: .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 1 file changed, 39 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bd3dc2f5 [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe bd3dc2f5 is described below commit bd3dc2f54d871d152331612c53f586181f4e87fc Author: sychen AuthorDate: Tue Sep 8 11:54:04 2020 + [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe ### What changes were proposed in this pull request? Before SPARK-31511 is fixed, `BytesToBytesMap` iterator() is not thread-safe and may cause data inaccuracy. We need to add a unit test. ### Why are the changes needed? Increase test coverage to ensure that iterator() is thread-safe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? add ut Closes #29669 from cxzl25/SPARK-31511-test. Authored-by: sychen Signed-off-by: Wenchen Fan --- .../sql/execution/joins/HashedRelationSuite.scala | 39 ++ 1 file changed, 39 insertions(+) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala index 72e921d..caa7bdf 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala @@ -360,6 +360,45 @@ class HashedRelationSuite extends SharedSparkSession { assert(java.util.Arrays.equals(os.toByteArray, os2.toByteArray)) } + test("SPARK-31511: Make BytesToBytesMap iterators thread-safe") { +val ser = sparkContext.env.serializer.newInstance() +val key = Seq(BoundReference(0, LongType, false)) + +val unsafeProj = UnsafeProjection.create( + Seq(BoundReference(0, LongType, false), BoundReference(1, IntegerType, true))) +val rows = (0 until 1).map(i => unsafeProj(InternalRow(Int.int2long(i), i + 1)).copy()) +val unsafeHashed = UnsafeHashedRelation(rows.iterator, key, 1, mm) + +val os = new ByteArrayOutputStream() +val thread1 = new Thread { + override def run(): Unit = { +val out = new ObjectOutputStream(os) +unsafeHashed.asInstanceOf[UnsafeHashedRelation].writeExternal(out) +out.flush() + } +} + +val thread2 = new Thread { + override def run(): Unit = { +val threadOut = new ObjectOutputStream(new ByteArrayOutputStream()) + unsafeHashed.asInstanceOf[UnsafeHashedRelation].writeExternal(threadOut) +threadOut.flush() + } +} + +thread1.start() +thread2.start() +thread1.join() +thread2.join() + +val unsafeHashed2 = ser.deserialize[UnsafeHashedRelation](ser.serialize(unsafeHashed)) +val os2 = new ByteArrayOutputStream() +val out2 = new ObjectOutputStream(os2) +unsafeHashed2.writeExternal(out2) +out2.flush() +assert(java.util.Arrays.equals(os.toByteArray, os2.toByteArray)) + } + // This test require 4G heap to run, should run it manually ignore("build HashedRelation that is larger than 1G") { val unsafeProj = UnsafeProjection.create( - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] maropu commented on a change in pull request #289: Release 3.0.1
maropu commented on a change in pull request #289: URL: https://github.com/apache/spark-website/pull/289#discussion_r484854079 ## File path: releases/_posts/2020-09-08-spark-release-3-0-1.md ## @@ -0,0 +1,52 @@ +--- +layout: post +title: Spark Release 3.0.1 +categories: [] +tags: [] +status: publish +type: post +published: true +meta: + _edit_last: '4' + _wpas_done_all: '1' +--- + +Spark 3.0.1 is a maintenance release containing stability fixes. This release is based on the branch-3.0 maintenance branch of Spark. We strongly recommend all 3.0 users to upgrade to this stable release. + +### Notable changes + - [[SPARK-26905]](https://issues.apache.org/jira/browse/SPARK-26905): Revisit reserved/non-reserved keywords based on the ANSI SQL standard + - [[SPARK-31220]](https://issues.apache.org/jira/browse/SPARK-31220): repartition obeys spark.sql.adaptive.coalescePartitions.initialPartitionNum when spark.sql.adaptive.enabled Review comment: Ah, ok. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] cloud-fan commented on a change in pull request #289: Release 3.0.1
cloud-fan commented on a change in pull request #289: URL: https://github.com/apache/spark-website/pull/289#discussion_r484851836 ## File path: releases/_posts/2020-09-08-spark-release-3-0-1.md ## @@ -0,0 +1,52 @@ +--- +layout: post +title: Spark Release 3.0.1 +categories: [] +tags: [] +status: publish +type: post +published: true +meta: + _edit_last: '4' + _wpas_done_all: '1' +--- + +Spark 3.0.1 is a maintenance release containing stability fixes. This release is based on the branch-3.0 maintenance branch of Spark. We strongly recommend all 3.0 users to upgrade to this stable release. + +### Notable changes + - [[SPARK-26905]](https://issues.apache.org/jira/browse/SPARK-26905): Revisit reserved/non-reserved keywords based on the ANSI SQL standard + - [[SPARK-31220]](https://issues.apache.org/jira/browse/SPARK-31220): repartition obeys spark.sql.adaptive.coalescePartitions.initialPartitionNum when spark.sql.adaptive.enabled Review comment: seems like we don't have sub-sections in other patch releases either. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (125cbe3 -> 55d38a4)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 125cbe3 [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl add 55d38a4 [SPARK-32748][SQL] Revert "Support local property propagation in SubqueryBroadcastExec" No new revisions were added by this update. Summary of changes: .../sql/execution/SubqueryBroadcastExec.scala | 16 ++ .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +- 2 files changed, 7 insertions(+), 72 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (125cbe3 -> 55d38a4)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 125cbe3 [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl add 55d38a4 [SPARK-32748][SQL] Revert "Support local property propagation in SubqueryBroadcastExec" No new revisions were added by this update. Summary of changes: .../sql/execution/SubqueryBroadcastExec.scala | 16 ++ .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +- 2 files changed, 7 insertions(+), 72 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (125cbe3 -> 55d38a4)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 125cbe3 [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl add 55d38a4 [SPARK-32748][SQL] Revert "Support local property propagation in SubqueryBroadcastExec" No new revisions were added by this update. Summary of changes: .../sql/execution/SubqueryBroadcastExec.scala | 16 ++ .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +- 2 files changed, 7 insertions(+), 72 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (125cbe3 -> 55d38a4)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 125cbe3 [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl add 55d38a4 [SPARK-32748][SQL] Revert "Support local property propagation in SubqueryBroadcastExec" No new revisions were added by this update. Summary of changes: .../sql/execution/SubqueryBroadcastExec.scala | 16 ++ .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +- 2 files changed, 7 insertions(+), 72 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (125cbe3 -> 55d38a4)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 125cbe3 [SPARK-32736][CORE] Avoid caching the removed decommissioned executors in TaskSchedulerImpl add 55d38a4 [SPARK-32748][SQL] Revert "Support local property propagation in SubqueryBroadcastExec" No new revisions were added by this update. Summary of changes: .../sql/execution/SubqueryBroadcastExec.scala | 16 ++ .../sql/internal/ExecutorSideSQLConfSuite.scala| 63 +- 2 files changed, 7 insertions(+), 72 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] maropu commented on a change in pull request #289: Release 3.0.1
maropu commented on a change in pull request #289: URL: https://github.com/apache/spark-website/pull/289#discussion_r484800121 ## File path: releases/_posts/2020-09-08-spark-release-3-0-1.md ## @@ -0,0 +1,52 @@ +--- +layout: post +title: Spark Release 3.0.1 +categories: [] +tags: [] +status: publish +type: post +published: true +meta: + _edit_last: '4' + _wpas_done_all: '1' +--- + +Spark 3.0.1 is a maintenance release containing stability fixes. This release is based on the branch-3.0 maintenance branch of Spark. We strongly recommend all 3.0 users to upgrade to this stable release. + +### Notable changes + - [[SPARK-26905]](https://issues.apache.org/jira/browse/SPARK-26905): Revisit reserved/non-reserved keywords based on the ANSI SQL standard + - [[SPARK-31220]](https://issues.apache.org/jira/browse/SPARK-31220): repartition obeys spark.sql.adaptive.coalescePartitions.initialPartitionNum when spark.sql.adaptive.enabled + - [[SPARK-31703]](https://issues.apache.org/jira/browse/SPARK-31703): Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64) + - [[SPARK-31915]](https://issues.apache.org/jira/browse/SPARK-31915): Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs + - [[SPARK-31923]](https://issues.apache.org/jira/browse/SPARK-31923): Event log cannot be generated when some internal accumulators use unexpected types + - [[SPARK-31935]](https://issues.apache.org/jira/browse/SPARK-31935): Hadoop file system config should be effective in data source options + - [[SPARK-31968]](https://issues.apache.org/jira/browse/SPARK-31968): write.partitionBy() creates duplicate subdirectories when user provides duplicate columns + - [[SPARK-31983]](https://issues.apache.org/jira/browse/SPARK-31983): Tables of structured streaming tab show wrong result for duration column + - [[SPARK-31990]](https://issues.apache.org/jira/browse/SPARK-31990): Streaming's state store compatibility is broken + - [[SPARK-32003]](https://issues.apache.org/jira/browse/SPARK-32003): Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost + - [[SPARK-32038]](https://issues.apache.org/jira/browse/SPARK-32038): Regression in handling NaN values in COUNT(DISTINCT) + - [[SPARK-32073]](https://issues.apache.org/jira/browse/SPARK-32073): Drop R < 3.5 support + - [[SPARK-32092]](https://issues.apache.org/jira/browse/SPARK-32092): CrossvalidatorModel does not save all submodels (it saves only 3) + - [[SPARK-32136]](https://issues.apache.org/jira/browse/SPARK-32136): Spark producing incorrect groupBy results when key is a struct with nullable properties + - [[SPARK-32148]](https://issues.apache.org/jira/browse/SPARK-32148): LEFT JOIN generating non-deterministic and unexpected result (regression in Spark 3.0) + - [[SPARK-32220]](https://issues.apache.org/jira/browse/SPARK-32220): Cartesian Product Hint cause data error + - [[SPARK-32310]](https://issues.apache.org/jira/browse/SPARK-32310): ML params default value parity + - [[SPARK-32339]](https://issues.apache.org/jira/browse/SPARK-32339): Improve MLlib BLAS native acceleration docs + - [[SPARK-32424]](https://issues.apache.org/jira/browse/SPARK-32424): Fix silent data change for timestamp parsing if overflow happens + - [[SPARK-32451]](https://issues.apache.org/jira/browse/SPARK-32451): Support Apache Arrow 1.0.0 in SparkR + - [[SPARK-32456]](https://issues.apache.org/jira/browse/SPARK-32456): Check the Distinct by assuming it as Aggregate for Structured Streaming + - [[SPARK-32608]](https://issues.apache.org/jira/browse/SPARK-32608): Script Transform DELIMIT value should be formatted + - [[SPARK-32646]](https://issues.apache.org/jira/browse/SPARK-32646): ORC predicate pushdown should work with case-insensitive analysis + - [[SPARK-32658]](https://issues.apache.org/jira/browse/SPARK-32658): Partition length number overflow in PartitionWriterStream + - [[SPARK-32676]](https://issues.apache.org/jira/browse/SPARK-32676): Fix double caching in KMeans/BiKMeans + + +### Known issues + - [[SPARK-31511]](https://issues.apache.org/jira/browse/SPARK-31511): Make BytesToBytesMap iterator() thread-safe Review comment: To follow [the 3.0.0 release note](https://spark.apache.org/releases/spark-release-3-0-0.html), how about adding a phrase `This will be fixed in Spark 3.0.2.` for already-fixed issues? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-
[GitHub] [spark-website] maropu commented on a change in pull request #289: Release 3.0.1
maropu commented on a change in pull request #289: URL: https://github.com/apache/spark-website/pull/289#discussion_r484795335 ## File path: releases/_posts/2020-09-08-spark-release-3-0-1.md ## @@ -0,0 +1,52 @@ +--- +layout: post +title: Spark Release 3.0.1 +categories: [] +tags: [] +status: publish +type: post +published: true +meta: + _edit_last: '4' + _wpas_done_all: '1' +--- + +Spark 3.0.1 is a maintenance release containing stability fixes. This release is based on the branch-3.0 maintenance branch of Spark. We strongly recommend all 3.0 users to upgrade to this stable release. + +### Notable changes + - [[SPARK-26905]](https://issues.apache.org/jira/browse/SPARK-26905): Revisit reserved/non-reserved keywords based on the ANSI SQL standard + - [[SPARK-31220]](https://issues.apache.org/jira/browse/SPARK-31220): repartition obeys spark.sql.adaptive.coalescePartitions.initialPartitionNum when spark.sql.adaptive.enabled Review comment: Could we split this `Notable changes` section into sub-secions like `SQL`, `ML`, ...? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org