[spark] branch branch-2.3 updated: [SPARK-28357][CORE][TEST] Fix Flaky Test - FileAppenderSuite.rollingfile appender - size-based rolling compressed
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.3 by this push: new 8821a8c [SPARK-28357][CORE][TEST] Fix Flaky Test - FileAppenderSuite.rollingfile appender - size-based rolling compressed 8821a8c is described below commit 8821a8cb8bd56e9960a67961fc604a42625438f9 Author: Dongjoon Hyun AuthorDate: Fri Jul 12 18:40:07 2019 +0900 [SPARK-28357][CORE][TEST] Fix Flaky Test - FileAppenderSuite.rollingfile appender - size-based rolling compressed ## What changes were proposed in this pull request? `SizeBasedRollingPolicy.shouldRollover` returns false when the size is equal to `rolloverSizeBytes`. ```scala /** Should rollover if the next set of bytes is going to exceed the size limit */ def shouldRollover(bytesToBeWritten: Long): Boolean = { logDebug(s"$bytesToBeWritten + $bytesWrittenSinceRollover > $rolloverSizeBytes") bytesToBeWritten + bytesWrittenSinceRollover > rolloverSizeBytes } ``` - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107553/testReport/org.apache.spark.util/FileAppenderSuite/rolling_file_appender___size_based_rolling__compressed_/ ``` org.scalatest.exceptions.TestFailedException: 1000 was not less than 1000 ``` ## How was this patch tested? Pass the Jenkins with the updated test. Closes #25125 from dongjoon-hyun/SPARK-28357. Authored-by: Dongjoon Hyun Signed-off-by: HyukjinKwon (cherry picked from commit 1c29212394adcbde2de4f4dfdc43a1cf32671ae1) Signed-off-by: HyukjinKwon --- core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala b/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala index 52cd537..1a3e880 100644 --- a/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala +++ b/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala @@ -128,7 +128,7 @@ class FileAppenderSuite extends SparkFunSuite with BeforeAndAfter with Logging { val files = testRolling(appender, testOutputStream, textToAppend, 0, isCompressed = true) files.foreach { file => logInfo(file.toString + ": " + file.length + " bytes") - assert(file.length < rolloverSize) + assert(file.length <= rolloverSize) } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (fe22faa -> 1c29212)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from fe22faa [SPARK-28034][SQL][TEST] Port with.sql add 1c29212 [SPARK-28357][CORE][TEST] Fix Flaky Test - FileAppenderSuite.rollingfile appender - size-based rolling compressed No new revisions were added by this update. Summary of changes: core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (1c29212 -> 13ae9eb)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 1c29212 [SPARK-28357][CORE][TEST] Fix Flaky Test - FileAppenderSuite.rollingfile appender - size-based rolling compressed add 13ae9eb [SPARK-28354][INFRA] Use JIRA user name instead of JIRA user key No new revisions were added by this update. Summary of changes: dev/merge_spark_pr.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28270][TEST-MAVEN][FOLLOW-UP][SQL][PYTHON][TESTS] Avoid cast input of UDF as double in the failed test in udf-aggregate_part1.sql
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 27e41d6 [SPARK-28270][TEST-MAVEN][FOLLOW-UP][SQL][PYTHON][TESTS] Avoid cast input of UDF as double in the failed test in udf-aggregate_part1.sql 27e41d6 is described below commit 27e41d65f158e8f0b04e73df195d3651c7eae485 Author: HyukjinKwon AuthorDate: Fri Jul 12 14:33:16 2019 +0900 [SPARK-28270][TEST-MAVEN][FOLLOW-UP][SQL][PYTHON][TESTS] Avoid cast input of UDF as double in the failed test in udf-aggregate_part1.sql ## What changes were proposed in this pull request? It still can be flaky on certain environments due to float limitation described at https://github.com/apache/spark/pull/25110 . See https://github.com/apache/spark/pull/25110#discussion_r302735905 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6584/testReport/org.apache.spark.sql/SQLQueryTestSuite/udf_pgSQL_udf_aggregates_part1_sql___Regular_Python_UDF/ ``` Expected "7000[6] 1", but got "7000[5] 1" Result did not match for query #33SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS long), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3))FROM (VALUES (70005), (70007)) v(x) ``` Here;s what's going on: https://github.com/apache/spark/pull/25110#discussion_r302791930 ``` scala> Seq("70004.999", "70006.999").toDF().selectExpr("CAST(avg(value) AS long)").show() +--+ |CAST(avg(value) AS BIGINT)| +--+ | 70005| +--+ ``` Therefore, this PR just avoid to cast in the specific test. This is a temp fix. We need more robust way to avoid such cases. ## How was this patch tested? It passes with Maven in my local before/after this PR. I believe the problem seems similarly the Python or OS installed in the machine. I should test this against PR builder with `test-maven` for sure.. Closes #25128 from HyukjinKwon/SPARK-28270-2. Authored-by: HyukjinKwon Signed-off-by: HyukjinKwon --- .../resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part1.sql | 2 +- .../sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part1.sql b/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part1.sql index 5b97d3d..3e87733 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part1.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part1.sql @@ -78,7 +78,7 @@ FROM (VALUES ('-Infinity'), ('Infinity')) v(x); -- test accuracy with a large input offset SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS int), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3)) FROM (VALUES (10003), (10004), (10006), (10007)) v(x); -SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS long), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3)) +SELECT CAST(avg(udf(x)) AS long), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3)) FROM (VALUES (70005), (70007)) v(x); -- SQL2003 binary aggregates [SPARK-23907] diff --git a/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out index 98e04b4..5c08245 100644 --- a/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out @@ -271,10 +271,10 @@ struct +struct -- !query 33 output 70006 1 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28378][PYTHON] Remove usage of cgi.escape
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 707411f [SPARK-28378][PYTHON] Remove usage of cgi.escape 707411f is described below commit 707411f4793b7d45e19bc5f368e1ec9fc9827737 Author: Liang-Chi Hsieh AuthorDate: Sun Jul 14 15:26:00 2019 +0900 [SPARK-28378][PYTHON] Remove usage of cgi.escape ## What changes were proposed in this pull request? `cgi.escape` is deprecated [1], and removed at 3.8 [2]. We better to replace it. [1] https://docs.python.org/3/library/cgi.html#cgi.escape. [2] https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals ## How was this patch tested? Existing tests. Closes #25142 from viirya/remove-cgi-escape. Authored-by: Liang-Chi Hsieh Signed-off-by: HyukjinKwon --- python/pyspark/sql/dataframe.py | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index f8be8ee..3984712 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -22,8 +22,10 @@ if sys.version >= '3': basestring = unicode = str long = int from functools import reduce +from html import escape as html_escape else: from itertools import imap as map +from cgi import escape as html_escape import warnings @@ -375,7 +377,6 @@ class DataFrame(object): by 'spark.sql.repl.eagerEval.enabled', this only called by REPL you are using support eager evaluation with HTML. """ -import cgi if not self._support_repr_html: self._support_repr_html = True if self.sql_ctx._conf.isReplEagerEvalEnabled(): @@ -390,11 +391,11 @@ class DataFrame(object): html = "\n" # generate table head -html += "%s\n" % "".join(map(lambda x: cgi.escape(x), head)) +html += "%s\n" % "".join(map(lambda x: html_escape(x), head)) # generate table rows for row in row_data: html += "%s\n" % "".join( -map(lambda x: cgi.escape(x), row)) +map(lambda x: html_escape(x), row)) html += "\n" if has_more_data: html += "only showing top %d %s\n" % ( - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-28378][PYTHON] Remove usage of cgi.escape
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 35d5886 [SPARK-28378][PYTHON] Remove usage of cgi.escape 35d5886 is described below commit 35d5886dd4b7335f59fa92fa9993b578cfbf67d6 Author: Liang-Chi Hsieh AuthorDate: Sun Jul 14 15:26:00 2019 +0900 [SPARK-28378][PYTHON] Remove usage of cgi.escape ## What changes were proposed in this pull request? `cgi.escape` is deprecated [1], and removed at 3.8 [2]. We better to replace it. [1] https://docs.python.org/3/library/cgi.html#cgi.escape. [2] https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals ## How was this patch tested? Existing tests. Closes #25142 from viirya/remove-cgi-escape. Authored-by: Liang-Chi Hsieh Signed-off-by: HyukjinKwon --- python/pyspark/sql/dataframe.py | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index 1affc9b..45b90b5 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -22,8 +22,10 @@ if sys.version >= '3': basestring = unicode = str long = int from functools import reduce +from html import escape as html_escape else: from itertools import imap as map +from cgi import escape as html_escape import warnings @@ -393,7 +395,6 @@ class DataFrame(object): by 'spark.sql.repl.eagerEval.enabled', this only called by REPL you are using support eager evaluation with HTML. """ -import cgi if not self._support_repr_html: self._support_repr_html = True if self.sql_ctx._conf.isReplEagerEvalEnabled(): @@ -408,11 +409,11 @@ class DataFrame(object): html = "\n" # generate table head -html += "%s\n" % "".join(map(lambda x: cgi.escape(x), head)) +html += "%s\n" % "".join(map(lambda x: html_escape(x), head)) # generate table rows for row in row_data: html += "%s\n" % "".join( -map(lambda x: cgi.escape(x), row)) +map(lambda x: html_escape(x), row)) html += "\n" if has_more_data: html += "only showing top %d %s\n" % ( - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26713][CORE][FOLLOWUP] revert the partial fix in ShuffleBlockFetcherIterator
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d41bd7c [SPARK-26713][CORE][FOLLOWUP] revert the partial fix in ShuffleBlockFetcherIterator d41bd7c is described below commit d41bd7c8910a960dec5b8605f1cb5d607a4ce958 Author: Wenchen Fan AuthorDate: Tue Jul 9 11:52:12 2019 +0900 [SPARK-26713][CORE][FOLLOWUP] revert the partial fix in ShuffleBlockFetcherIterator ## What changes were proposed in this pull request? This PR reverts the partial bug fix in `ShuffleBlockFetcherIterator` which was introduced by https://github.com/apache/spark/pull/23638 . The reasons: 1. It's a potential bug. After fixing `PipelinedRDD` in #23638 , the original problem was resolved. 2. The fix is incomplete according to [the discussion](https://github.com/apache/spark/pull/23638#discussion_r251869084) We should fix the potential bug completely later. ## How was this patch tested? existing tests Closes #25049 from cloud-fan/revert. Authored-by: Wenchen Fan Signed-off-by: HyukjinKwon --- .../storage/ShuffleBlockFetcherIterator.scala | 16 ++ .../storage/ShuffleBlockFetcherIteratorSuite.scala | 58 -- 2 files changed, 3 insertions(+), 71 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala b/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala index a4d91a7..a283757 100644 --- a/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala +++ b/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala @@ -142,14 +142,7 @@ final class ShuffleBlockFetcherIterator( /** * Whether the iterator is still active. If isZombie is true, the callback interface will no - * longer place fetched blocks into [[results]] and the iterator is marked as fully consumed. - * - * When the iterator is inactive, [[hasNext]] and [[next]] calls will honor that as there are - * cases the iterator is still being consumed. For example, ShuffledRDD + PipedRDD if the - * subprocess command is failed. The task will be marked as failed, then the iterator will be - * cleaned up at task completion, the [[next]] call (called in the stdin writer thread of - * PipedRDD if not exited yet) may hang at [[results.take]]. The defensive check in [[hasNext]] - * and [[next]] reduces the possibility of such race conditions. + * longer place fetched blocks into [[results]]. */ @GuardedBy("this") private[this] var isZombie = false @@ -388,7 +381,7 @@ final class ShuffleBlockFetcherIterator( logDebug(s"Got local blocks in ${Utils.getUsedTimeNs(startTimeNs)}") } - override def hasNext: Boolean = !isZombie && (numBlocksProcessed < numBlocksToFetch) + override def hasNext: Boolean = numBlocksProcessed < numBlocksToFetch /** * Fetches the next (BlockId, InputStream). If a task fails, the ManagedBuffers @@ -412,7 +405,7 @@ final class ShuffleBlockFetcherIterator( // then fetch it one more time if it's corrupt, throw FailureFetchResult if the second fetch // is also corrupt, so the previous stage could be retried. // For local shuffle block, throw FailureFetchResult for the first IOException. -while (!isZombie && result == null) { +while (result == null) { val startFetchWait = System.nanoTime() result = results.take() val fetchWaitTime = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startFetchWait) @@ -505,9 +498,6 @@ final class ShuffleBlockFetcherIterator( fetchUpToMaxBytes() } -if (result == null) { // the iterator is already closed/cleaned up. - throw new NoSuchElementException() -} currentResult = result.asInstanceOf[SuccessFetchResult] (currentResult.blockId, new BufferReleasingInputStream( diff --git a/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala b/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala index 3ab2f0b..ed40244 100644 --- a/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala +++ b/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala @@ -215,64 +215,6 @@ class ShuffleBlockFetcherIteratorSuite extends SparkFunSuite with PrivateMethodT verify(blocks(ShuffleBlockId(0, 2, 0)), times(0)).release() } - test("iterator is all consumed if task completes early") { -val blockManager = mock(classOf[BlockManager]) -val localBmId = BlockManagerId("test-client", "test-client", 1) -doReturn(localBmId).when(blockManager).blockManagerId - -// Make su
[spark] branch master updated: [SPARK-28309][R][INFRA] Fix AppVeyor to run SparkR tests by avoiding to use devtools for testthat
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f15102b [SPARK-28309][R][INFRA] Fix AppVeyor to run SparkR tests by avoiding to use devtools for testthat f15102b is described below commit f15102b1702b64a54233ae31357e32335722f4e5 Author: HyukjinKwon AuthorDate: Tue Jul 9 12:06:46 2019 +0900 [SPARK-28309][R][INFRA] Fix AppVeyor to run SparkR tests by avoiding to use devtools for testthat ## What changes were proposed in this pull request? Looks `devtools` 2.1.0 is released and then our AppVeyor users the latest one. The problem is, they added `testthat` 2.1.1+ as its dependency - https://github.com/r-lib/devtools/blob/master/DESCRIPTION#L35 Usually it should remove and reinstall it properly when we install other packages; however, seems it's being failed in AppVeyor due to the previous installation for an unknown reason. ``` [00:01:41] > devtools::install_version('testthat', version = '1.0.2', repos='https://cloud.r-project.org/') [00:01:44] Downloading package from url: https://cloud.r-project.org//src/contrib/Archive/testthat/testthat_1.0.2.tar.gz ... [00:02:25] WARNING: moving package to final location failed, copying instead [00:02:25] Warning in file.copy(instdir, dirname(final_instdir), recursive = TRUE, : [00:02:25] problem copying c:\RLibrary\00LOCK-testthat\00new\testthat\libs\i386\testthat.dll to c:\RLibrary\testthat\libs\i386\testthat.dll: Permission denied [00:02:25] ** testing if installed package can be loaded from final location [00:02:25] *** arch - i386 [00:02:26] Error: package or namespace load failed for 'testthat' in FUN(X[[i]], ...): [00:02:26] no such symbol find_label_ in package c:/RLibrary/testthat/libs/i386/testthat.dll [00:02:26] Error: loading failed [00:02:26] Execution halted [00:02:26] *** arch - x64 [00:02:26] ERROR: loading failed for 'i386' [00:02:26] * removing 'c:/RLibrary/testthat' [00:02:26] * restoring previous 'c:/RLibrary/testthat' [00:02:26] Warning in file.copy(lp, dirname(pkgdir), recursive = TRUE, copy.date = TRUE) : [00:02:26] problem copying c:\RLibrary\00LOCK-testthat\testthat\libs\i386\testthat.dll to c:\RLibrary\testthat\libs\i386\testthat.dll: Permission denied [00:02:26] Warning message: [00:02:26] In i.p(...) : [00:02:26] installation of package 'C:/Users/appveyor/AppData/Local/Temp/1/RtmpIx25hi/remotes5743d4a9b1/testthat' had non-zero exit status ``` See https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/25818746 Our SparkR testbed requires `testthat` 1.0.2 at most for the current status and `devtools` was installed at SPARK-22817 to pin the `testthat` version to 1.0.2 Therefore, this PR works around the current issue by directly installing from the archive instead, and don't use `devtools`. ```R R -e "install.packages('https://cloud.r-project.org/src/contrib/Archive/testthat/testthat_1.0.2.tar.gz', repos=NULL, type='source')" ``` ## How was this patch tested? AppVeyor will test. Closes #25081 from HyukjinKwon/SPARK-28309. Authored-by: HyukjinKwon Signed-off-by: HyukjinKwon --- appveyor.yml | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/appveyor.yml b/appveyor.yml index bdb948b..8fb090c 100644 --- a/appveyor.yml +++ b/appveyor.yml @@ -42,9 +42,12 @@ install: # Install maven and dependencies - ps: .\dev\appveyor-install-dependencies.ps1 # Required package for R unit tests - - cmd: R -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" + - cmd: R -e "install.packages(c('knitr', 'rmarkdown', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" # Here, we use the fixed version of testthat. For more details, please see SPARK-22817. - - cmd: R -e "devtools::install_version('testthat', version = '1.0.2', repos='https://cloud.r-project.org/')" + # As of devtools 2.1.0, it requires testthat higher then 2.1.1 as a dependency. SparkR test requires testthat 1.0.2. + # Therefore, we don't use devtools but installs it directly from the archive including its dependencies. + - cmd: R -e "install.packages(c('crayon', 'praise', 'R6'), repos='https://cloud.r-project.org/')" + - cmd: R -e "install.packages('https://cloud.r-project.org/src/contrib/Archive/testthat/testthat_1.0.2.tar.gz', repos=NULL, type='source')" - cmd: R -e "packageVersion('knitr'); packageVersion('rmarkdown'); packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival')" build_script: ---
[spark] branch master updated (d41bd7c -> 75ea02b)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d41bd7c [SPARK-26713][CORE][FOLLOWUP] revert the partial fix in ShuffleBlockFetcherIterator add 75ea02b [SPARK-28250][SQL] QueryPlan#references should exclude producedAttributes No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/plans/QueryPlan.scala | 21 ++--- .../plans/logical/basicLogicalOperators.scala | 3 +-- .../plans/logical/pythonLogicalOperators.scala | 3 +-- .../apache/spark/sql/execution/GenerateExec.scala | 2 +- .../sql/execution/columnar/InMemoryRelation.scala | 2 -- .../org/apache/spark/sql/execution/objects.scala| 2 -- .../spark/sql/execution/python/EvalPythonExec.scala | 3 +-- 7 files changed, 14 insertions(+), 22 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 55f92a3 [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows 55f92a3 is described below commit 55f92a31d7c1a6f02a9b0fc2ace6c5a5e0871ec4 Author: wuyi AuthorDate: Tue Jul 9 15:49:31 2019 +0900 [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows ## What changes were proposed in this pull request? When using SparkLauncher to submit applications **concurrently** with multiple threads under **Windows**, some apps would show that "The process cannot access the file because it is being used by another process" and remains in LOST state at the end. The issue can be reproduced by this [demo](https://issues.apache.org/jira/secure/attachment/12973920/Main.scala). After digging into the code, I find that, Windows cmd `%RANDOM%` would return the same number if we call it instantly(e.g. < 500ms) after last call. As a result, SparkLauncher would get same output file(spark-class-launcher-output-%RANDOM%.txt) for apps. Then, the following app would hit the issue when it tries to write the same file which has already been opened for writing by another app. We should make sure to generate unique output file for SparkLauncher on Windows to avoid this issue. ## How was this patch tested? Tested manually on Windows. Closes #25076 from Ngone51/SPARK-28302. Authored-by: wuyi Signed-off-by: HyukjinKwon (cherry picked from commit 925f620570a022ff8229bfde076e7dde6bf242df) Signed-off-by: HyukjinKwon --- bin/spark-class2.cmd | 5 + 1 file changed, 5 insertions(+) diff --git a/bin/spark-class2.cmd b/bin/spark-class2.cmd index 5da7d7a..34d04c9 100644 --- a/bin/spark-class2.cmd +++ b/bin/spark-class2.cmd @@ -63,7 +63,12 @@ if not "x%JAVA_HOME%"=="x" ( rem The launcher library prints the command to be executed in a single line suitable for being rem executed by the batch interpreter. So read all the output of the launcher into a variable. +:gen set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt +rem SPARK-28302: %RANDOM% would return the same number if we call it instantly after last call, +rem so we should make it sure to generate unique file to avoid process collision of writing into +rem the same file concurrently. +if exist %LAUNCHER_OUTPUT% goto :gen "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main %* > %LAUNCHER_OUTPUT% for /f "tokens=*" %%i in (%LAUNCHER_OUTPUT%) do ( set SPARK_CMD=%%i - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 925f620 [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows 925f620 is described below commit 925f620570a022ff8229bfde076e7dde6bf242df Author: wuyi AuthorDate: Tue Jul 9 15:49:31 2019 +0900 [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows ## What changes were proposed in this pull request? When using SparkLauncher to submit applications **concurrently** with multiple threads under **Windows**, some apps would show that "The process cannot access the file because it is being used by another process" and remains in LOST state at the end. The issue can be reproduced by this [demo](https://issues.apache.org/jira/secure/attachment/12973920/Main.scala). After digging into the code, I find that, Windows cmd `%RANDOM%` would return the same number if we call it instantly(e.g. < 500ms) after last call. As a result, SparkLauncher would get same output file(spark-class-launcher-output-%RANDOM%.txt) for apps. Then, the following app would hit the issue when it tries to write the same file which has already been opened for writing by another app. We should make sure to generate unique output file for SparkLauncher on Windows to avoid this issue. ## How was this patch tested? Tested manually on Windows. Closes #25076 from Ngone51/SPARK-28302. Authored-by: wuyi Signed-off-by: HyukjinKwon --- bin/spark-class2.cmd | 5 + 1 file changed, 5 insertions(+) diff --git a/bin/spark-class2.cmd b/bin/spark-class2.cmd index 5da7d7a..34d04c9 100644 --- a/bin/spark-class2.cmd +++ b/bin/spark-class2.cmd @@ -63,7 +63,12 @@ if not "x%JAVA_HOME%"=="x" ( rem The launcher library prints the command to be executed in a single line suitable for being rem executed by the batch interpreter. So read all the output of the launcher into a variable. +:gen set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt +rem SPARK-28302: %RANDOM% would return the same number if we call it instantly after last call, +rem so we should make it sure to generate unique file to avoid process collision of writing into +rem the same file concurrently. +if exist %LAUNCHER_OUTPUT% goto :gen "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main %* > %LAUNCHER_OUTPUT% for /f "tokens=*" %%i in (%LAUNCHER_OUTPUT%) do ( set SPARK_CMD=%%i - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.3 updated: [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.3 by this push: new 74caacf [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows 74caacf is described below commit 74caacf5c002c9ccec9a8f8b6f20cb3887469176 Author: wuyi AuthorDate: Tue Jul 9 15:49:31 2019 +0900 [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows ## What changes were proposed in this pull request? When using SparkLauncher to submit applications **concurrently** with multiple threads under **Windows**, some apps would show that "The process cannot access the file because it is being used by another process" and remains in LOST state at the end. The issue can be reproduced by this [demo](https://issues.apache.org/jira/secure/attachment/12973920/Main.scala). After digging into the code, I find that, Windows cmd `%RANDOM%` would return the same number if we call it instantly(e.g. < 500ms) after last call. As a result, SparkLauncher would get same output file(spark-class-launcher-output-%RANDOM%.txt) for apps. Then, the following app would hit the issue when it tries to write the same file which has already been opened for writing by another app. We should make sure to generate unique output file for SparkLauncher on Windows to avoid this issue. ## How was this patch tested? Tested manually on Windows. Closes #25076 from Ngone51/SPARK-28302. Authored-by: wuyi Signed-off-by: HyukjinKwon (cherry picked from commit 925f620570a022ff8229bfde076e7dde6bf242df) Signed-off-by: HyukjinKwon --- bin/spark-class2.cmd | 5 + 1 file changed, 5 insertions(+) diff --git a/bin/spark-class2.cmd b/bin/spark-class2.cmd index 5da7d7a..34d04c9 100644 --- a/bin/spark-class2.cmd +++ b/bin/spark-class2.cmd @@ -63,7 +63,12 @@ if not "x%JAVA_HOME%"=="x" ( rem The launcher library prints the command to be executed in a single line suitable for being rem executed by the batch interpreter. So read all the output of the launcher into a variable. +:gen set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt +rem SPARK-28302: %RANDOM% would return the same number if we call it instantly after last call, +rem so we should make it sure to generate unique file to avoid process collision of writing into +rem the same file concurrently. +if exist %LAUNCHER_OUTPUT% goto :gen "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main %* > %LAUNCHER_OUTPUT% for /f "tokens=*" %%i in (%LAUNCHER_OUTPUT%) do ( set SPARK_CMD=%%i - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28233][BUILD] Upgrade maven-jar-plugin and maven-source-plugin
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 90bd017 [SPARK-28233][BUILD] Upgrade maven-jar-plugin and maven-source-plugin 90bd017 is described below commit 90bd017c107382b7accdadfb52b02df98e8ea7bb Author: Yuming Wang AuthorDate: Wed Jul 3 21:46:11 2019 +0900 [SPARK-28233][BUILD] Upgrade maven-jar-plugin and maven-source-plugin ## What changes were proposed in this pull request? This pr upgrade `maven-jar-plugin` to 3.1.2 and `maven-source-plugin` to 3.1.0 to avoid: - [MJAR-259](https://issues.apache.org/jira/browse/MJAR-259) – Archiving to jar is very slow - [MSOURCES-119](https://issues.apache.org/jira/browse/MSOURCES-119) – Archiving to jar is very slow Release notes: https://blogs.apache.org/maven/entry/apache-maven-source-plugin-version https://blogs.apache.org/maven/entry/apache-maven-jar-plugin-version2 https://blogs.apache.org/maven/entry/apache-maven-jar-plugin-version1 ## How was this patch tested? N/A Closes #25031 from wangyum/SPARK-28233. Authored-by: Yuming Wang Signed-off-by: HyukjinKwon --- pom.xml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/pom.xml b/pom.xml index 087c2ca..88a3954 100644 --- a/pom.xml +++ b/pom.xml @@ -2431,7 +2431,7 @@ org.apache.maven.plugins maven-jar-plugin - 3.1.0 + 3.1.2 org.apache.maven.plugins @@ -2441,7 +2441,7 @@ org.apache.maven.plugins maven-source-plugin - 3.0.1 + 3.1.0 true @@ -2600,7 +2600,7 @@ org.apache.maven.plugins maven-jar-plugin -3.1.0 +3.1.2 test-jar - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28381][PYSPARK] Upgraded version of Pyrolite to 4.30
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 591de42 [SPARK-28381][PYSPARK] Upgraded version of Pyrolite to 4.30 591de42 is described below commit 591de423517de379b0900fbf5751b48492d15729 Author: Liang-Chi Hsieh AuthorDate: Mon Jul 15 12:29:58 2019 +0900 [SPARK-28381][PYSPARK] Upgraded version of Pyrolite to 4.30 ## What changes were proposed in this pull request? This upgraded to a newer version of Pyrolite. Most updates [1] in the newer version are for dotnot. For java, it includes a bug fix to Unpickler regarding cleaning up Unpickler memo, and support of protocol 5. After upgrading, we can remove the fix at SPARK-27629 for the bug in Unpickler. [1] https://github.com/irmen/Pyrolite/compare/pyrolite-4.23...master ## How was this patch tested? Manually tested on Python 3.6 in local on existing tests. Closes #25143 from viirya/upgrade-pyrolite. Authored-by: Liang-Chi Hsieh Signed-off-by: HyukjinKwon --- core/pom.xml | 2 +- core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala | 3 --- dev/deps/spark-deps-hadoop-2.7| 2 +- dev/deps/spark-deps-hadoop-3.2| 2 +- .../main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala | 4 python/pyspark/sql/tests/test_serde.py| 4 .../org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala | 4 7 files changed, 3 insertions(+), 18 deletions(-) diff --git a/core/pom.xml b/core/pom.xml index 8a872de..4446dbd 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -378,7 +378,7 @@ net.razorvine pyrolite - 4.23 + 4.30 net.razorvine diff --git a/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala b/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala index 9462dfd..01e64b6 100644 --- a/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala +++ b/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala @@ -186,9 +186,6 @@ private[spark] object SerDeUtil extends Logging { val unpickle = new Unpickler iter.flatMap { row => val obj = unpickle.loads(row) -// `Opcodes.MEMOIZE` of Protocol 4 (Python 3.4+) will store objects in internal map -// of `Unpickler`. This map is cleared when calling `Unpickler.close()`. -unpickle.close() if (batched) { obj match { case array: Array[Any] => array.toSeq diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7 index 2f660cc..79158bb 100644 --- a/dev/deps/spark-deps-hadoop-2.7 +++ b/dev/deps/spark-deps-hadoop-2.7 @@ -170,7 +170,7 @@ parquet-hadoop-bundle-1.6.0.jar parquet-jackson-1.10.1.jar protobuf-java-2.5.0.jar py4j-0.10.8.1.jar -pyrolite-4.23.jar +pyrolite-4.30.jar scala-compiler-2.12.8.jar scala-library-2.12.8.jar scala-parser-combinators_2.12-1.1.0.jar diff --git a/dev/deps/spark-deps-hadoop-3.2 b/dev/deps/spark-deps-hadoop-3.2 index e1e114f..5e03a59 100644 --- a/dev/deps/spark-deps-hadoop-3.2 +++ b/dev/deps/spark-deps-hadoop-3.2 @@ -189,7 +189,7 @@ parquet-hadoop-1.10.1.jar parquet-jackson-1.10.1.jar protobuf-java-2.5.0.jar py4j-0.10.8.1.jar -pyrolite-4.23.jar +pyrolite-4.30.jar re2j-1.1.jar scala-compiler-2.12.8.jar scala-library-2.12.8.jar diff --git a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala index 4c478a5..4617073 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala @@ -1357,10 +1357,6 @@ private[spark] abstract class SerDeBase { val unpickle = new Unpickler iter.flatMap { row => val obj = unpickle.loads(row) -// `Opcodes.MEMOIZE` of Protocol 4 (Python 3.4+) will store objects in internal map -// of `Unpickler`. This map is cleared when calling `Unpickler.close()`. Pyrolite -// doesn't clear it up, so we manually clear it. -unpickle.close() if (batched) { obj match { case list: JArrayList[_] => list.asScala diff --git a/python/pyspark/sql/tests/test_serde.py b/python/pyspark/sql/tests/test_serde.py index f9bed76..ea2a686 100644 --- a/python/pyspark/sql/tests/test_serde.py +++ b/python/pyspark/sql/tests/test_serde.py @@ -128,10 +128,6 @@ class SerdeTests(ReusedSQLTestCase): def test_int_array_serialization(self): # Note that this test seems dependent on parallelism. -
[spark] branch master updated: [SPARK-28392][SQL][TESTS] Add traits for UDF and PostgreSQL tests to share initialization
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a7a02a8 [SPARK-28392][SQL][TESTS] Add traits for UDF and PostgreSQL tests to share initialization a7a02a8 is described below commit a7a02a86adafd3808051d843cf7e70176a7c4099 Author: HyukjinKwon AuthorDate: Mon Jul 15 16:20:09 2019 +0900 [SPARK-28392][SQL][TESTS] Add traits for UDF and PostgreSQL tests to share initialization ## What changes were proposed in this pull request? This PR adds some traits so that we can deduplicate initialization stuff for each type of test case. For instance, see [SPARK-28343](https://issues.apache.org/jira/browse/SPARK-28343). It's a little bit overkill but I think it will make adding test cases easier and cause less confusions. This PR adds both: ``` private trait PgSQLTest private trait UDFTest ``` To indicate and share the logics related to each combination of test types. ## How was this patch tested? Manually tested. Closes #25155 from HyukjinKwon/SPARK-28392. Authored-by: HyukjinKwon Signed-off-by: HyukjinKwon --- .../sql-tests/inputs/udf/pgSQL/udf-case.sql| 5 - .../sql-tests/results/udf/pgSQL/udf-case.sql.out | 190 ++--- .../org/apache/spark/sql/SQLQueryTestSuite.scala | 56 -- 3 files changed, 129 insertions(+), 122 deletions(-) diff --git a/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-case.sql b/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-case.sql index b05c21d..a2aab79 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-case.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-case.sql @@ -6,14 +6,10 @@ -- https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/case.sql -- Test the CASE statement -- --- This test suite contains two Cartesian products without using explicit CROSS JOIN syntax. --- Thus, we set spark.sql.crossJoin.enabled to true. - -- This test file was converted from pgSQL/case.sql. -- Note that currently registered UDF returns a string. So there are some differences, for instance -- in string cast within UDF in Scala and Python. -set spark.sql.crossJoin.enabled=true; CREATE TABLE CASE_TBL ( i integer, f double @@ -269,4 +265,3 @@ SELECT CASE DROP TABLE CASE_TBL; DROP TABLE CASE2_TBL; -set spark.sql.crossJoin.enabled=false; diff --git a/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-case.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-case.sql.out index 55bef64..6bb7a78 100644 --- a/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-case.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-case.sql.out @@ -1,19 +1,22 @@ -- Automatically generated by SQLQueryTestSuite --- Number of queries: 37 +-- Number of queries: 35 -- !query 0 -set spark.sql.crossJoin.enabled=true +CREATE TABLE CASE_TBL ( + i integer, + f double +) USING parquet -- !query 0 schema -struct +struct<> -- !query 0 output -spark.sql.crossJoin.enabledtrue + -- !query 1 -CREATE TABLE CASE_TBL ( +CREATE TABLE CASE2_TBL ( i integer, - f double + j integer ) USING parquet -- !query 1 schema struct<> @@ -22,10 +25,7 @@ struct<> -- !query 2 -CREATE TABLE CASE2_TBL ( - i integer, - j integer -) USING parquet +INSERT INTO CASE_TBL VALUES (1, 10.1) -- !query 2 schema struct<> -- !query 2 output @@ -33,7 +33,7 @@ struct<> -- !query 3 -INSERT INTO CASE_TBL VALUES (1, 10.1) +INSERT INTO CASE_TBL VALUES (2, 20.2) -- !query 3 schema struct<> -- !query 3 output @@ -41,7 +41,7 @@ struct<> -- !query 4 -INSERT INTO CASE_TBL VALUES (2, 20.2) +INSERT INTO CASE_TBL VALUES (3, -30.3) -- !query 4 schema struct<> -- !query 4 output @@ -49,7 +49,7 @@ struct<> -- !query 5 -INSERT INTO CASE_TBL VALUES (3, -30.3) +INSERT INTO CASE_TBL VALUES (4, NULL) -- !query 5 schema struct<> -- !query 5 output @@ -57,7 +57,7 @@ struct<> -- !query 6 -INSERT INTO CASE_TBL VALUES (4, NULL) +INSERT INTO CASE2_TBL VALUES (1, -1) -- !query 6 schema struct<> -- !query 6 output @@ -65,7 +65,7 @@ struct<> -- !query 7 -INSERT INTO CASE2_TBL VALUES (1, -1) +INSERT INTO CASE2_TBL VALUES (2, -2) -- !query 7 schema struct<> -- !query 7 output @@ -73,7 +73,7 @@ struct<> -- !query 8 -INSERT INTO CASE2_TBL VALUES (2, -2) +INSERT INTO CASE2_TBL VALUES (3, -3) -- !query 8 schema struct<> -- !query 8 output @@ -81,7 +81,7 @@ struct<> -- !query 9 -INSERT INTO CASE2_TBL VALUES (3, -3) +INSERT INTO CASE2_TBL VALUES (2, -4) -- !query 9 schema struct<> -- !query 9 outp
[spark] branch master updated: [SPARK-27532][DOC] Correct the default value in the Documentation for "spark.redaction.regex"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4cb1cd6 [SPARK-27532][DOC] Correct the default value in the Documentation for "spark.redaction.regex" 4cb1cd6 is described below commit 4cb1cd6ab7b7bffd045a786b0ddb7c7783afdf46 Author: shivusondur AuthorDate: Sun Apr 21 16:56:12 2019 +0900 [SPARK-27532][DOC] Correct the default value in the Documentation for "spark.redaction.regex" ## What changes were proposed in this pull request? Corrected the default value in the Documentation for "spark.redaction.regex" ## How was this patch tested? NA Closes #24428 from shivusondur/doc2. Authored-by: shivusondur Signed-off-by: HyukjinKwon --- docs/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index 5325f8a..2cb1a5f 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -472,7 +472,7 @@ Apart from these, the following properties are also available, and may be useful spark.redaction.regex - (?i)secret|password + (?i)secret|password|token Regex to decide which Spark configuration properties and environment variables in driver and executor environments contain sensitive information. When this regex matches a property key or - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27527][SQL][DOCS] Improve descriptions of Timestamp and Date types
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d61b3bc [SPARK-27527][SQL][DOCS] Improve descriptions of Timestamp and Date types d61b3bc is described below commit d61b3bc875941d5a815e0e68fe7aa986e372b4e8 Author: Maxim Gekk AuthorDate: Sun Apr 21 16:53:11 2019 +0900 [SPARK-27527][SQL][DOCS] Improve descriptions of Timestamp and Date types ## What changes were proposed in this pull request? In the PR, I propose more precise description of `TimestampType` and `DateType`, how they store timestamps and dates internally. Closes #24424 from MaxGekk/timestamp-date-type-doc. Authored-by: Maxim Gekk Signed-off-by: HyukjinKwon --- .../scala/org/apache/spark/sql/types/DateType.scala | 20 .../org/apache/spark/sql/types/TimestampType.scala | 19 ++- 2 files changed, 26 insertions(+), 13 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala index 7491014..ba322fa 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala @@ -23,19 +23,18 @@ import scala.reflect.runtime.universe.typeTag import org.apache.spark.annotation.Stable /** - * A date type, supporting "0001-01-01" through "-12-31". - * - * Please use the singleton `DataTypes.DateType`. - * - * Internally, this is represented as the number of days from 1970-01-01. + * The date type represents a valid date in the proleptic Gregorian calendar. + * Valid range is [0001-01-01, -12-31]. * + * Please use the singleton `DataTypes.DateType` to refer the type. * @since 1.3.0 */ @Stable class DateType private() extends AtomicType { - // The companion object and this class is separated so the companion object also subclasses - // this type. Otherwise, the companion object would be of type "DateType$" in byte code. - // Defined with a private constructor so the companion object is the only possible instantiation. + /** + * Internally, a date is stored as a simple incrementing count of days + * where day 0 is 1970-01-01. Negative numbers represent earlier days. + */ private[sql] type InternalType = Int @transient private[sql] lazy val tag = typeTag[InternalType] @@ -51,6 +50,11 @@ class DateType private() extends AtomicType { } /** + * The companion case object and the DateType class is separated so the companion object + * also subclasses the class. Otherwise, the companion object would be of type "DateType$" + * in byte code. The DateType class is defined with a private constructor so its companion + * object is the only possible instantiation. + * * @since 1.3.0 */ @Stable diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala index a20f155..8dbe4dd 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala @@ -23,16 +23,20 @@ import scala.reflect.runtime.universe.typeTag import org.apache.spark.annotation.Stable /** - * The data type representing `java.sql.Timestamp` values. - * Please use the singleton `DataTypes.TimestampType`. + * The timestamp type represents a time instant in microsecond precision. + * Valid range is [0001-01-01T00:00:00.00Z, -12-31T23:59:59.99Z] where + * the left/right-bound is a date and time of the proleptic Gregorian + * calendar in UTC+00:00. * + * Please use the singleton `DataTypes.TimestampType` to refer the type. * @since 1.3.0 */ @Stable class TimestampType private() extends AtomicType { - // The companion object and this class is separated so the companion object also subclasses - // this type. Otherwise, the companion object would be of type "TimestampType$" in byte code. - // Defined with a private constructor so the companion object is the only possible instantiation. + /** + * Internally, a timestamp is stored as the number of microseconds from + * the epoch of 1970-01-01T00:00:00.00Z (UTC+00:00) + */ private[sql] type InternalType = Long @transient private[sql] lazy val tag = typeTag[InternalType] @@ -48,6 +52,11 @@ class TimestampType private() extends AtomicType { } /** + * The companion case object and its class is separated so the companion object also subclasses + * the TimestampType class. Otherwise, the companion object would be of type "TimestampType$" + * in byte code. Defined with a private constructor so the companion object i
[spark] branch master updated: [SPARK-27522][SQL][TEST] Test migration from INT96 to TIMESTAMP_MICROS for timestamps in parquet
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 777b797 [SPARK-27522][SQL][TEST] Test migration from INT96 to TIMESTAMP_MICROS for timestamps in parquet 777b797 is described below commit 777b797867045c9717813e9dab2ab9012a1889fc Author: Maxim Gekk AuthorDate: Mon Apr 22 16:34:13 2019 +0900 [SPARK-27522][SQL][TEST] Test migration from INT96 to TIMESTAMP_MICROS for timestamps in parquet ## What changes were proposed in this pull request? Added tests to check migration from `INT96` to `TIMESTAMP_MICROS` (`INT64`) for timestamps in parquet files. In particular: - Append `TIMESTAMP_MICROS` timestamps to **existing parquet** files with `INT96` timestamps - Append `TIMESTAMP_MICROS` timestamps to a table with `INT96` timestamps - Append `INT96` to `TIMESTAMP_MICROS` timestamps in **parquet files** - Append `INT96` to `TIMESTAMP_MICROS` timestamps in a **table** Closes #24417 from MaxGekk/parquet-timestamp-int64-tests. Authored-by: Maxim Gekk Signed-off-by: HyukjinKwon --- .../datasources/parquet/ParquetQuerySuite.scala| 43 ++ 1 file changed, 43 insertions(+) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala index 8cc3bee..4959275 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala @@ -18,6 +18,7 @@ package org.apache.spark.sql.execution.datasources.parquet import java.io.File +import java.util.concurrent.TimeUnit import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.parquet.hadoop.ParquetOutputFormat @@ -881,6 +882,48 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext } } + test("Migration from INT96 to TIMESTAMP_MICROS timestamp type") { +def testMigration(fromTsType: String, toTsType: String): Unit = { + def checkAppend(write: DataFrameWriter[_] => Unit, readback: => DataFrame): Unit = { +def data(start: Int, end: Int): Seq[Row] = (start to end).map { i => + val ts = new java.sql.Timestamp(TimeUnit.SECONDS.toMillis(i)) + ts.setNanos(123456000) + Row(ts) +} +val schema = new StructType().add("time", TimestampType) +val df1 = spark.createDataFrame(sparkContext.parallelize(data(0, 1)), schema) +withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> fromTsType) { + write(df1.write) +} +val df2 = spark.createDataFrame(sparkContext.parallelize(data(2, 10)), schema) +withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> toTsType) { + write(df2.write.mode(SaveMode.Append)) +} +Seq("true", "false").foreach { vectorized => + withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> vectorized) { +checkAnswer(readback, df1.unionAll(df2)) + } +} + } + + Seq(false, true).foreach { mergeSchema => +withTempPath { file => + checkAppend(_.parquet(file.getCanonicalPath), +spark.read.option("mergeSchema", mergeSchema).parquet(file.getCanonicalPath)) +} + +withSQLConf(SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> mergeSchema.toString) { + val tableName = "parquet_timestamp_migration" + withTable(tableName) { +checkAppend(_.saveAsTable(tableName), spark.table(tableName)) + } +} + } +} + +testMigration(fromTsType = "INT96", toTsType = "TIMESTAMP_MICROS") +testMigration(fromTsType = "TIMESTAMP_MICROS", toTsType = "INT96") + } } object TestingUDT { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d36cce1 [SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds d36cce1 is described below commit d36cce18e262dc9cbd687ef42f8b67a62f0a3e22 Author: Bryan Cutler AuthorDate: Mon Apr 22 19:30:31 2019 +0900 [SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds ## What changes were proposed in this pull request? This increases the minimum support version of pyarrow to 0.12.1 and removes workarounds in pyspark to remain compatible with prior versions. This means that users will need to have at least pyarrow 0.12.1 installed and available in the cluster or an `ImportError` will be raised to indicate an upgrade is needed. ## How was this patch tested? Existing tests using: Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2 Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0 Closes #24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276. Authored-by: Bryan Cutler Signed-off-by: HyukjinKwon --- python/pyspark/serializers.py | 53 ++ python/pyspark/sql/dataframe.py| 8 +-- python/pyspark/sql/session.py | 7 +-- python/pyspark/sql/tests/test_arrow.py | 39 +++--- python/pyspark/sql/tests/test_pandas_udf.py| 48 ++--- .../sql/tests/test_pandas_udf_grouped_map.py | 28 -- python/pyspark/sql/tests/test_pandas_udf_scalar.py | 30 +++ python/pyspark/sql/types.py| 63 -- python/pyspark/sql/utils.py| 2 +- python/setup.py| 7 +-- 10 files changed, 67 insertions(+), 218 deletions(-) diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py index ed419db..6058e94 100644 --- a/python/pyspark/serializers.py +++ b/python/pyspark/serializers.py @@ -260,10 +260,14 @@ class ArrowStreamPandasSerializer(ArrowStreamSerializer): self._safecheck = safecheck self._assign_cols_by_name = assign_cols_by_name -def arrow_to_pandas(self, arrow_column, data_type): -from pyspark.sql.types import _arrow_column_to_pandas, _check_series_localize_timestamps +def arrow_to_pandas(self, arrow_column): +from pyspark.sql.types import _check_series_localize_timestamps + +# If the given column is a date type column, creates a series of datetime.date directly +# instead of creating datetime64[ns] as intermediate data to avoid overflow caused by +# datetime64[ns] type handling. +s = arrow_column.to_pandas(date_as_object=True) -s = _arrow_column_to_pandas(arrow_column, data_type) s = _check_series_localize_timestamps(s, self._timezone) return s @@ -275,8 +279,6 @@ class ArrowStreamPandasSerializer(ArrowStreamSerializer): :param series: A single pandas.Series, list of Series, or list of (series, arrow_type) :return: Arrow RecordBatch """ -import decimal -from distutils.version import LooseVersion import pandas as pd import pyarrow as pa from pyspark.sql.types import _check_series_convert_timestamps_internal @@ -289,24 +291,10 @@ class ArrowStreamPandasSerializer(ArrowStreamSerializer): def create_array(s, t): mask = s.isnull() # Ensure timestamp series are in expected form for Spark internal representation -# TODO: maybe don't need None check anymore as of Arrow 0.9.1 if t is not None and pa.types.is_timestamp(t): s = _check_series_convert_timestamps_internal(s.fillna(0), self._timezone) # TODO: need cast after Arrow conversion, ns values cause error with pandas 0.19.2 return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False) -elif t is not None and pa.types.is_string(t) and sys.version < '3': -# TODO: need decode before converting to Arrow in Python 2 -# TODO: don't need as of Arrow 0.9.1 -return pa.Array.from_pandas(s.apply( -lambda v: v.decode("utf-8") if isinstance(v, str) else v), mask=mask, type=t) -elif t is not None and pa.types.is_decimal(t) and \ -LooseVersion("0.9.0") <= LooseVersion(pa.__version__) < LooseVersion("0.10.0"): -# TODO: see ARROW-2432. Remove when the minimum PyArrow version becomes 0.10.0. -return pa.Array.from_pandas(s.apply( -lambda v: decimal.D
[spark] branch master updated: [SPARK-27535][SQL][TEST] Date and timestamp JSON benchmarks
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 93a264d [SPARK-27535][SQL][TEST] Date and timestamp JSON benchmarks 93a264d is described below commit 93a264d05a55c2617d34e977dbaf182987187a27 Author: Maxim Gekk AuthorDate: Tue Apr 23 11:09:14 2019 +0900 [SPARK-27535][SQL][TEST] Date and timestamp JSON benchmarks ## What changes were proposed in this pull request? Added new JSON benchmarks related to date and timestamps operations: - Write date/timestamp to JSON files - `to_json()` and `from_json()` for dates and timestamps - Read date/timestamps from JSON files, and infer schemas - Parse and infer schemas from `Dataset[String]` Also existing JSON benchmarks are ported on `NoOp` datasource. Closes #24430 from MaxGekk/json-datetime-benchmark. Authored-by: Maxim Gekk Signed-off-by: HyukjinKwon --- sql/core/benchmarks/JSONBenchmark-results.txt | 79 + .../execution/datasources/json/JsonBenchmark.scala | 126 - 2 files changed, 179 insertions(+), 26 deletions(-) diff --git a/sql/core/benchmarks/JSONBenchmark-results.txt b/sql/core/benchmarks/JSONBenchmark-results.txt index 2b784c3..7846983 100644 --- a/sql/core/benchmarks/JSONBenchmark-results.txt +++ b/sql/core/benchmarks/JSONBenchmark-results.txt @@ -7,77 +7,106 @@ Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4 Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz JSON schema inferring:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding 51280 51722 420 2.0 512.8 1.0X -UTF-8 is set 75009 77276 1963 1.3 750.1 0.7X +No encoding 50949 51086 150 2.0 509.5 1.0X +UTF-8 is set 72012 72147 120 1.4 720.1 0.7X Preparing data for benchmarking ... Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4 Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz count a short column: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding 39675 39738 83 2.5 396.7 1.0X -UTF-8 is set 62755 64399 1436 1.6 627.5 0.6X +No encoding 36799 36891 80 2.7 368.0 1.0X +UTF-8 is set 59796 59880 74 1.7 598.0 0.6X Preparing data for benchmarking ... Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4 Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding 56429 56468 65 0.25642.9 1.0X -UTF-8 is set 81078 81454 374 0.18107.8 0.7X +No encoding 55803 55967 152 0.25580.3 1.0X +UTF-8 is set 80623 80825 178 0.18062.3 0.7X Preparing data for benchmarking ... Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4 Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz select wide row: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -No encoding 95329 95557 265 0.0 190658.2 1.0X -UTF-8 is set 102827 102967 166 0.0 205654.2 0.9X +No encoding 84263 85750 1476 0.0 168526.2 1.0X +UTF-8 is set
[spark] branch master updated: [SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 55f26d8 [SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks 55f26d8 is described below commit 55f26d809008d26e9727874128aee0a61dcfea00 Author: Maxim Gekk AuthorDate: Tue Apr 23 11:08:02 2019 +0900 [SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks ## What changes were proposed in this pull request? Added new CSV benchmarks related to date and timestamps operations: - Write date/timestamp to CSV files - `to_csv()` and `from_csv()` for dates and timestamps - Read date/timestamps from CSV files, and infer schemas - Parse and infer schemas from `Dataset[String]` Also existing CSV benchmarks are ported on `NoOp` datasource. Closes #24429 from MaxGekk/csv-timestamp-benchmark. Authored-by: Maxim Gekk Signed-off-by: HyukjinKwon --- sql/core/benchmarks/CSVBenchmark-results.txt | 73 +--- .../execution/datasources/csv/CSVBenchmark.scala | 201 ++--- 2 files changed, 229 insertions(+), 45 deletions(-) diff --git a/sql/core/benchmarks/CSVBenchmark-results.txt b/sql/core/benchmarks/CSVBenchmark-results.txt index 4fef15b..888c2ce 100644 --- a/sql/core/benchmarks/CSVBenchmark-results.txt +++ b/sql/core/benchmarks/CSVBenchmark-results.txt @@ -2,29 +2,58 @@ Benchmark to measure CSV read/write performance -Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic -Intel(R) Xeon(R) CPU @ 2.50GHz -Parsing quoted values: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - -One quoted string 49754 / 50158 0.0 995072.2 1.0X +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4 +Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz +Parsing quoted values:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative + +One quoted string 36998 37134 120 0.0 739953.1 1.0X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic -Intel(R) Xeon(R) CPU @ 2.50GHz -Wide rows with 1000 columns: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - -Select 1000 columns 149402 / 151785 0.0 149401.9 1.0X -Select 100 columns 42986 / 43985 0.0 42986.1 3.5X -Select one column 33764 / 34057 0.0 33763.6 4.4X -count() 9332 / 9508 0.1 9332.2 16.0X -Select 100 columns, one bad input field 50963 / 51512 0.0 50962.5 2.9X -Select 100 columns, corrupt record field69627 / 71029 0.0 69627.5 2.1X +Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4 +Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz +Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative + +Select 1000 columns 140620 141162 737 0.0 140620.5 1.0X +Select 100 columns35170 35287 183 0.0 35170.0 4.0X +Select one column 27711 27927 187 0.0 27710.9 5.1X +count()7707 7804 84 0.17707.4 18.2X +Select 100 columns, one bad input field 41762 41851 117 0.0 41761.8 3.4X +Select 100 columns, corrupt record field 48717 48761 44 0.0 48717.4 2.9X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic -Intel(R) Xeon(R) CPU @ 2.50GHz -Count a dataset with 10 columns: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - -Select 10 columns + count() 22588 / 22623 0.4 2258.8 1.0X -Select 1 column + count
[spark] branch master updated: [SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 43a73e3 [SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default 43a73e3 is described below commit 43a73e387cb843486adcf5b8bbd8b99010ce6e02 Author: Maxim Gekk AuthorDate: Tue Apr 23 11:06:39 2019 +0900 [SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default ## What changes were proposed in this pull request? In the PR, I propose to use the `TIMESTAMP_MICROS` logical type for timestamps written to parquet files. The type matches semantically to Catalyst's `TimestampType`, and stores microseconds since epoch in UTC time zone. This will allow to avoid conversions of microseconds to nanoseconds and to Julian calendar. Also this will reduce sizes of written parquet files. ## How was this patch tested? By existing test suites. Closes #24425 from MaxGekk/parquet-timestamp_micros. Authored-by: Maxim Gekk Signed-off-by: HyukjinKwon --- docs/sql-migration-guide-upgrade.md | 2 ++ .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala| 2 +- .../datasources/parquet/ParquetInteroperabilitySuite.scala| 8 ++-- .../test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala | 2 +- 4 files changed, 10 insertions(+), 4 deletions(-) diff --git a/docs/sql-migration-guide-upgrade.md b/docs/sql-migration-guide-upgrade.md index 90a7d8d..54512ae 100644 --- a/docs/sql-migration-guide-upgrade.md +++ b/docs/sql-migration-guide-upgrade.md @@ -124,6 +124,8 @@ license: | - In Spark version 2.4, when a spark session is created via `cloneSession()`, the newly created spark session inherits its configuration from its parent `SparkContext` even though the same configuration may exist with a different value in its parent spark session. Since Spark 3.0, the configurations of a parent `SparkSession` have a higher precedence over the parent `SparkContext`. + - Since Spark 3.0, parquet logical type `TIMESTAMP_MICROS` is used by default while saving `TIMESTAMP` columns. In Spark version 2.4 and earlier, `TIMESTAMP` columns are saved as `INT96` in parquet files. To set `INT96` to `spark.sql.parquet.outputTimestampType` restores the previous behavior. + ## Upgrading from Spark SQL 2.4 to 2.4.1 - The value of `spark.executor.heartbeatInterval`, when specified without units like "30" rather than "30s", was diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index b223a48..9ebd2c0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -405,7 +405,7 @@ object SQLConf { .stringConf .transform(_.toUpperCase(Locale.ROOT)) .checkValues(ParquetOutputTimestampType.values.map(_.toString)) -.createWithDefault(ParquetOutputTimestampType.INT96.toString) +.createWithDefault(ParquetOutputTimestampType.TIMESTAMP_MICROS.toString) val PARQUET_INT64_AS_TIMESTAMP_MILLIS = buildConf("spark.sql.parquet.int64AsTimestampMillis") .doc(s"(Deprecated since Spark 2.3, please set ${PARQUET_OUTPUT_TIMESTAMP_TYPE.key}.) " + diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala index f06e186..09793bd 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala @@ -120,8 +120,12 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS ).map { s => java.sql.Timestamp.valueOf(s) } import testImplicits._ // match the column names of the file from impala - val df = spark.createDataset(ts).toDF().repartition(1).withColumnRenamed("value", "ts") - df.write.parquet(tableDir.getAbsolutePath) + withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> +SQLConf.ParquetOutputTimestampType.INT96.toString) { +val df = spark.createDataset(ts).toDF().repartition(1) + .withColumnRenamed("value", "ts") +df.write.parquet(tableDir.getAbsolutePath) + } FileUtils.copyFile(new File(impalaPath), new File(tableDir, "part-1.parq")) Seq(false, true).foreach { int96TimestampConversion => diff --git a/sql/core/src/test/scala/org/ap
[spark] branch master updated: [SPARK-27470][PYSPARK] Update pyrolite to 4.23
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8718367 [SPARK-27470][PYSPARK] Update pyrolite to 4.23 8718367 is described below commit 8718367e2e739f1ed82997b9f4a1298b7a1c4e49 Author: Sean Owen AuthorDate: Tue Apr 16 19:41:40 2019 +0900 [SPARK-27470][PYSPARK] Update pyrolite to 4.23 ## What changes were proposed in this pull request? Update pyrolite to 4.23 to pick up bug and security fixes. ## How was this patch tested? Existing tests. Closes #24381 from srowen/SPARK-27470. Authored-by: Sean Owen Signed-off-by: HyukjinKwon --- core/pom.xml | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- dev/deps/spark-deps-hadoop-3.2 | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/core/pom.xml b/core/pom.xml index 45bda44..9d57028 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -347,7 +347,7 @@ net.razorvine pyrolite - 4.13 + 4.23 net.razorvine diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7 index 00dc2ce..8ae59cc 100644 --- a/dev/deps/spark-deps-hadoop-2.7 +++ b/dev/deps/spark-deps-hadoop-2.7 @@ -170,7 +170,7 @@ parquet-hadoop-bundle-1.6.0.jar parquet-jackson-1.10.1.jar protobuf-java-2.5.0.jar py4j-0.10.8.1.jar -pyrolite-4.13.jar +pyrolite-4.23.jar scala-compiler-2.12.8.jar scala-library-2.12.8.jar scala-parser-combinators_2.12-1.1.0.jar diff --git a/dev/deps/spark-deps-hadoop-3.2 b/dev/deps/spark-deps-hadoop-3.2 index 97085d6..bbb0d73 100644 --- a/dev/deps/spark-deps-hadoop-3.2 +++ b/dev/deps/spark-deps-hadoop-3.2 @@ -191,7 +191,7 @@ parquet-hadoop-bundle-1.6.0.jar parquet-jackson-1.10.1.jar protobuf-java-2.5.0.jar py4j-0.10.8.1.jar -pyrolite-4.13.jar +pyrolite-4.23.jar re2j-1.1.jar scala-compiler-2.12.8.jar scala-library-2.12.8.jar - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR][BUILD] Update genjavadoc to 0.13
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 596a5ff [MINOR][BUILD] Update genjavadoc to 0.13 596a5ff is described below commit 596a5ff2737e531fbca2f31db1eb9aadd8f08882 Author: Sean Owen AuthorDate: Wed Apr 24 13:44:48 2019 +0900 [MINOR][BUILD] Update genjavadoc to 0.13 ## What changes were proposed in this pull request? Kind of related to https://github.com/gatorsmile/spark/pull/5 - let's update genjavadoc to see if it generates fewer spurious javadoc errors to begin with. ## How was this patch tested? Existing docs build Closes #24443 from srowen/genjavadoc013. Authored-by: Sean Owen Signed-off-by: HyukjinKwon --- .../org/apache/spark/rpc/RpcCallContext.scala | 2 +- .../spark/status/api/v1/ApiRootResource.scala | 6 +++--- .../org/apache/spark/util/SizeEstimator.scala | 2 +- .../streaming/kinesis/SparkAWSCredentials.scala| 4 ++-- .../main/scala/org/apache/spark/ml/ann/Layer.scala | 6 +++--- .../org/apache/spark/ml/attribute/attributes.scala | 2 +- .../org/apache/spark/ml/stat/Correlation.scala | 2 +- .../org/apache/spark/ml/tree/treeParams.scala | 25 +++--- .../mllib/stat/test/StreamingTestMethod.scala | 2 +- project/SparkBuild.scala | 2 +- .../apache/spark/sql/hive/client/HiveClient.scala | 6 +++--- 11 files changed, 30 insertions(+), 29 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/rpc/RpcCallContext.scala b/core/src/main/scala/org/apache/spark/rpc/RpcCallContext.scala index 117f51c..f6b2059 100644 --- a/core/src/main/scala/org/apache/spark/rpc/RpcCallContext.scala +++ b/core/src/main/scala/org/apache/spark/rpc/RpcCallContext.scala @@ -24,7 +24,7 @@ package org.apache.spark.rpc private[spark] trait RpcCallContext { /** - * Reply a message to the sender. If the sender is [[RpcEndpoint]], its [[RpcEndpoint.receive]] + * Reply a message to the sender. If the sender is [[RpcEndpoint]], its `RpcEndpoint.receive` * will be called. */ def reply(response: Any): Unit diff --git a/core/src/main/scala/org/apache/spark/status/api/v1/ApiRootResource.scala b/core/src/main/scala/org/apache/spark/status/api/v1/ApiRootResource.scala index 84c2ad4..83f76db 100644 --- a/core/src/main/scala/org/apache/spark/status/api/v1/ApiRootResource.scala +++ b/core/src/main/scala/org/apache/spark/status/api/v1/ApiRootResource.scala @@ -77,7 +77,7 @@ private[spark] trait UIRoot { /** * Runs some code with the current SparkUI instance for the app / attempt. * - * @throws NoSuchElementException If the app / attempt pair does not exist. + * @throws java.util.NoSuchElementException If the app / attempt pair does not exist. */ def withSparkUI[T](appId: String, attemptId: Option[String])(fn: SparkUI => T): T @@ -85,8 +85,8 @@ private[spark] trait UIRoot { def getApplicationInfo(appId: String): Option[ApplicationInfo] /** - * Write the event logs for the given app to the [[ZipOutputStream]] instance. If attemptId is - * [[None]], event logs for all attempts of this application will be written out. + * Write the event logs for the given app to the `ZipOutputStream` instance. If attemptId is + * `None`, event logs for all attempts of this application will be written out. */ def writeEventLogs(appId: String, attemptId: Option[String], zipStream: ZipOutputStream): Unit = { Response.serverError() diff --git a/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala b/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala index e09f1fc..09c69f5 100644 --- a/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala +++ b/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala @@ -34,7 +34,7 @@ import org.apache.spark.util.collection.OpenHashSet /** * A trait that allows a class to give [[SizeEstimator]] more accurate size estimation. * When a class extends it, [[SizeEstimator]] will query the `estimatedSize` first. - * If `estimatedSize` does not return [[None]], [[SizeEstimator]] will use the returned size + * If `estimatedSize` does not return `None`, [[SizeEstimator]] will use the returned size * as the size of the object. Otherwise, [[SizeEstimator]] will do the estimation work. * The difference between a [[KnownSizeEstimation]] and * [[org.apache.spark.util.collection.SizeTracker]] is that, a diff --git a/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SparkAWSCredentials.scala b/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SparkAWSCredentials.scala index dcb60b2..7488971 100644 --- a/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SparkAWSCredentials.sc
[spark] branch master updated: [SPARK-27512][SQL] Avoid to replace ', ' in CSV's decimal type inference for backward compatibility
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a30983d [SPARK-27512][SQL] Avoid to replace ',' in CSV's decimal type inference for backward compatibility a30983d is described below commit a30983db575de5c87b3a4698b223229327fd65cf Author: HyukjinKwon AuthorDate: Wed Apr 24 16:22:07 2019 +0900 [SPARK-27512][SQL] Avoid to replace ',' in CSV's decimal type inference for backward compatibility ## What changes were proposed in this pull request? The code below currently infers as decimal but previously it was inferred as string. **In branch-2.4**, type inference path for decimal and parsing data are different. https://github.com/apache/spark/blob/2a8343121e62aabe5c69d1e20fbb2c01e2e520e7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L153 https://github.com/apache/spark/blob/c284c4e1f6f684ca8db1cc446fdcc43b46e3413c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L125 So the code below: ```scala scala> spark.read.option("delimiter", "|").option("inferSchema", "true").csv(Seq("1,2").toDS).printSchema() ``` produced string as its type. ``` root |-- _c0: string (nullable = true) ``` **In the current master**, it now infers decimal as below: ``` root |-- _c0: decimal(2,0) (nullable = true) ``` It happened after https://github.com/apache/spark/pull/22979 because, now after this PR, we only have one way to parse decimal: https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L92 **After the fix:** ``` root |-- _c0: string (nullable = true) ``` This PR proposes to restore the previous behaviour back in `CSVInferSchema`. ## How was this patch tested? Manually tested and unit tests were added. Closes #24437 from HyukjinKwon/SPARK-27512. Authored-by: HyukjinKwon Signed-off-by: HyukjinKwon --- .../scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala | 7 ++- .../org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala| 4 +++- .../org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala | 7 +++ 3 files changed, 16 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala index ae9f057..03cc3cb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala @@ -17,6 +17,8 @@ package org.apache.spark.sql.catalyst.csv +import java.util.Locale + import scala.util.control.Exception.allCatch import org.apache.spark.rdd.RDD @@ -32,7 +34,10 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable { options.zoneId, options.locale) - private val decimalParser = { + private val decimalParser = if (options.locale == Locale.US) { +// Special handling the default locale for backward compatibility +s: String => new java.math.BigDecimal(s) + } else { ExprUtils.getDecimalParser(options.locale) } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala index c2b525a..24d909e 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala @@ -185,6 +185,8 @@ class CSVInferSchemaSuite extends SparkFunSuite with SQLHelper { assert(inferSchema.inferField(NullType, input) == expectedType) } -Seq("en-US", "ko-KR", "ru-RU", "de-DE").foreach(checkDecimalInfer(_, DecimalType(7, 0))) +// input like '1,0' is inferred as strings for backward compatibility. +Seq("en-US").foreach(checkDecimalInfer(_, StringType)) +Seq("ko-KR", "ru-RU", "de-DE").foreach(checkDecimalInfer(_, DecimalType(7, 0))) } } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index 6584a30..90deade 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/c
[spark] branch master updated: [SPARK-27563][SQL][TEST] automatically get the latest Spark versions in HiveExternalCatalogVersionsSuite
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2234667 [SPARK-27563][SQL][TEST] automatically get the latest Spark versions in HiveExternalCatalogVersionsSuite 2234667 is described below commit 2234667b159bf19a68758da3ff20cfae3c058c25 Author: Wenchen Fan AuthorDate: Fri Apr 26 16:37:43 2019 +0900 [SPARK-27563][SQL][TEST] automatically get the latest Spark versions in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? We can get the latest downloadable Spark versions from https://dist.apache.org/repos/dist/release/spark/ ## How was this patch tested? manually. Closes #24454 from cloud-fan/test. Authored-by: Wenchen Fan Signed-off-by: HyukjinKwon --- .../sql/hive/HiveExternalCatalogVersionsSuite.scala | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala index 0a05ec5..ec10295 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala @@ -22,6 +22,7 @@ import java.nio.charset.StandardCharsets import java.nio.file.{Files, Paths} import scala.sys.process._ +import scala.util.control.NonFatal import org.apache.hadoop.conf.Configuration @@ -169,6 +170,10 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils { """.stripMargin.getBytes("utf8")) // scalastyle:on line.size.limit +if (PROCESS_TABLES.testingVersions.isEmpty) { + fail("Fail to get the lates Spark versions to test.") +} + PROCESS_TABLES.testingVersions.zipWithIndex.foreach { case (version, index) => val sparkHome = new File(sparkTestingDir, s"spark-$version") if (!sparkHome.exists()) { @@ -206,7 +211,19 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils { object PROCESS_TABLES extends QueryTest with SQLTestUtils { // Tests the latest version of every release line. - val testingVersions = Seq("2.3.3", "2.4.2") + val testingVersions: Seq[String] = { +import scala.io.Source +try { + Source.fromURL("https://dist.apache.org/repos/dist/release/spark/;).mkString +.split("\n") +.filter(_.contains("""""".r.findFirstMatchIn(_).get.group(1)) +.filter(_ < org.apache.spark.SPARK_VERSION) +} catch { + // do not throw exception during object initialization. + case NonFatal(_) => Nil +} + } protected var spark: SparkSession = _ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28129][SQL][TEST] Port float8.sql
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f74ad3d [SPARK-28129][SQL][TEST] Port float8.sql f74ad3d is described below commit f74ad3d7004e833b6dbc07d6281407ab89ef2d32 Author: Yuming Wang AuthorDate: Tue Jul 16 19:31:20 2019 +0900 [SPARK-28129][SQL][TEST] Port float8.sql ## What changes were proposed in this pull request? This PR is to port float8.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/float8.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/float8.out When porting the test cases, found six PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28060](https://issues.apache.org/jira/browse/SPARK-28060): Double type can not accept some special inputs [SPARK-28027](https://issues.apache.org/jira/browse/SPARK-28027): Spark SQL does not support prefix operator `` and `|/` [SPARK-28061](https://issues.apache.org/jira/browse/SPARK-28061): Support for converting float to binary format [SPARK-23906](https://issues.apache.org/jira/browse/SPARK-23906): Support Truncate number [SPARK-28134](https://issues.apache.org/jira/browse/SPARK-28134): Missing Trigonometric Functions Also, found two bug: [SPARK-28024](https://issues.apache.org/jira/browse/SPARK-28024): Incorrect value when out of range [SPARK-28135](https://issues.apache.org/jira/browse/SPARK-28135): ceil/ceiling/floor/power returns incorrect values Also, found four inconsistent behavior: [SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL insert bad inputs to NULL [SPARK-28028](https://issues.apache.org/jira/browse/SPARK-28028): Cast numeric to integral type need round [SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL returns NULL when dividing by zero [SPARK-28007](https://issues.apache.org/jira/browse/SPARK-28007): Caret operator (^) means bitwise XOR in Spark/Hive and exponentiation in Postgres ## How was this patch tested? N/A Closes #24931 from wangyum/SPARK-28129. Authored-by: Yuming Wang Signed-off-by: HyukjinKwon --- .../resources/sql-tests/inputs/pgSQL/float8.sql| 500 .../sql-tests/results/pgSQL/float8.sql.out | 839 + 2 files changed, 1339 insertions(+) diff --git a/sql/core/src/test/resources/sql-tests/inputs/pgSQL/float8.sql b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/float8.sql new file mode 100644 index 000..6f8e3b5 --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/float8.sql @@ -0,0 +1,500 @@ +-- +-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group +-- +-- +-- FLOAT8 +-- https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/float8.sql + +CREATE TABLE FLOAT8_TBL(f1 double) USING parquet; + +INSERT INTO FLOAT8_TBL VALUES ('0.0 '); +INSERT INTO FLOAT8_TBL VALUES ('1004.30 '); +INSERT INTO FLOAT8_TBL VALUES (' -34.84'); +INSERT INTO FLOAT8_TBL VALUES ('1.2345678901234e+200'); +INSERT INTO FLOAT8_TBL VALUES ('1.2345678901234e-200'); + +-- [SPARK-28024] Incorrect numeric values when out of range +-- test for underflow and overflow handling +SELECT double('10e400'); +SELECT double('-10e400'); +SELECT double('10e-400'); +SELECT double('-10e-400'); + +-- [SPARK-28061] Support for converting float to binary format +-- test smallest normalized input +-- SELECT float8send('2.2250738585072014E-308'::float8); + +-- [SPARK-27923] Spark SQL insert there bad inputs to NULL +-- bad input +-- INSERT INTO FLOAT8_TBL VALUES (''); +-- INSERT INTO FLOAT8_TBL VALUES (' '); +-- INSERT INTO FLOAT8_TBL VALUES ('xyz'); +-- INSERT INTO FLOAT8_TBL VALUES ('5.0.0'); +-- INSERT INTO FLOAT8_TBL VALUES ('5 . 0'); +-- INSERT INTO FLOAT8_TBL VALUES ('5. 0'); +-- INSERT INTO FLOAT8_TBL VALUES ('- 3'); +-- INSERT INTO FLOAT8_TBL VALUES ('123 5'); + +-- special inputs +SELECT double('NaN'); +-- [SPARK-28060] Double type can not accept some special inputs +SELECT double('nan'); +SELECT double(' NAN '); +SELECT double('infinity'); +SELECT double(' -INFINiTY '); +-- [SPARK-27923] Spark SQL insert there bad special inputs to NULL +-- bad special inputs +SELECT double('N A N'); +SELECT double('NaN x'); +SELECT double(' INFINITYx'); + +SELECT double('Infinity') + 100.0; +-- [SPARK-27768] Infinity, -Infinity, NaN should be recognized in a case insensitive manner +SELECT double('Infinity') / double('Infinity'); +SELECT double('NaN') / double('NaN'); +-- [SPARK-28315] Decimal can not accept NaN as input +SELECT double(decimal('nan
[spark] branch master updated: [SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d83f84a [SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles d83f84a is described below commit d83f84a1229138b7935b4b18da80a96d3c5c3dde Author: Josh Rosen AuthorDate: Wed Jun 26 09:11:28 2019 +0900 [SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles ## What changes were proposed in this pull request? Spark's `InMemoryFileIndex` contains two places where `FileNotFound` exceptions are caught and logged as warnings (during [directory listing](https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274) and [block location lookup](https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources [...] I think that this is a dangerous default behavior because it can mask bugs caused by race conditions (e.g. overwriting a table while it's being read) or S3 consistency issues (there's more discussion on this in the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-27676)). Failing fast when we detect missing files is not sufficient to make concurrent table reads/writes or S3 listing safe (there are other classes of eventual consistency issues to worry about), but I think it's [...] There may be some cases where users _do_ want to ignore missing files, but I think that should be an opt-in behavior via the existing `spark.sql.files.ignoreMissingFiles` flag (the current behavior is itself race-prone because a file might be be deleted between catalog listing and query execution time, triggering FileNotFoundExceptions on executors (which are handled in a way that _does_ respect `ignoreMissingFIles`)). This PR updates `InMemoryFileIndex` to guard the log-and-ignore-FileNotFoundException behind the existing `spark.sql.files.ignoreMissingFiles` flag. **Note**: this is a change of default behavior, so I think it needs to be mentioned in release notes. ## How was this patch tested? New unit tests to simulate file-deletion race conditions, tested with both values of the `ignoreMissingFIles` flag. Closes #24668 from JoshRosen/SPARK-27676. Lead-authored-by: Josh Rosen Co-authored-by: Josh Rosen Signed-off-by: HyukjinKwon --- docs/sql-migration-guide-upgrade.md| 2 + .../spark/sql/execution/command/CommandUtils.scala | 2 +- .../execution/datasources/InMemoryFileIndex.scala | 71 ++-- .../sql/execution/datasources/FileIndexSuite.scala | 126 - 4 files changed, 188 insertions(+), 13 deletions(-) diff --git a/docs/sql-migration-guide-upgrade.md b/docs/sql-migration-guide-upgrade.md index b062a04..c920f06 100644 --- a/docs/sql-migration-guide-upgrade.md +++ b/docs/sql-migration-guide-upgrade.md @@ -145,6 +145,8 @@ license: | - Since Spark 3.0, a higher-order function `exists` follows the three-valued boolean logic, i.e., if the `predicate` returns any `null`s and no `true` is obtained, then `exists` will return `null` instead of `false`. For example, `exists(array(1, null, 3), x -> x % 2 == 0)` will be `null`. The previous behaviour can be restored by setting `spark.sql.legacy.arrayExistsFollowsThreeValuedLogic` to `false`. + - Since Spark 3.0, if files or subdirectories disappear during recursive directory listing (i.e. they appear in an intermediate listing but then cannot be read or listed during later phases of the recursive directory listing, due to either concurrent file deletions or object store consistency issues) then the listing will fail with an exception unless `spark.sql.files.ignoreMissingFiles` is `true` (default `false`). In previous versions, these missing files or subdirectories would be i [...] + ## Upgrading from Spark SQL 2.4 to 2.4.1 - The value of `spark.executor.heartbeatInterval`, when specified without units like "30" rather than "30s", was diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala index cac2519..b644e6d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala @@ -67,7 +67,7 @@ object CommandUtils extends Logging { override def accept(path: Path): Boolean = isDataPath(path, stagingDir) } val fileStatusSeq = InMemoryFileIndex.bulkListLeaf
[spark] branch master updated: [SPARK-28054][SQL][FOLLOW-UP] Fix error when insert Hive partitioned table dynamically where partition name is upper case
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f148674 [SPARK-28054][SQL][FOLLOW-UP] Fix error when insert Hive partitioned table dynamically where partition name is upper case f148674 is described below commit f1486742fa032d838c0730fcb968e42ac145acc8 Author: Liang-Chi Hsieh AuthorDate: Tue Jul 2 14:57:24 2019 +0900 [SPARK-28054][SQL][FOLLOW-UP] Fix error when insert Hive partitioned table dynamically where partition name is upper case ## What changes were proposed in this pull request? This is a small follow-up for SPARK-28054 to fix wrong indent and use `withSQLConf` as suggested by gatorsmile. ## How was this patch tested? Existing tests. Closes #24971 from viirya/SPARK-28054-followup. Authored-by: Liang-Chi Hsieh Signed-off-by: HyukjinKwon --- .../spark/sql/hive/execution/SaveAsHiveFile.scala | 2 +- .../spark/sql/hive/execution/HiveQuerySuite.scala | 21 +++-- 2 files changed, 12 insertions(+), 11 deletions(-) diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala index 234acb7..62d3bad 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala @@ -89,7 +89,7 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand { // we also need to lowercase the column names in written partition paths. // scalastyle:off caselocale val hiveCompatiblePartitionColumns = partitionAttributes.map { attr => - attr.withName(attr.name.toLowerCase) + attr.withName(attr.name.toLowerCase) } // scalastyle:on caselocale diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala index 13a533c..6986963 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala @@ -1190,19 +1190,20 @@ class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAnd } test("SPARK-28054: Unable to insert partitioned table when partition name is upper case") { -withTable("spark_28054_test") { - sql("set hive.exec.dynamic.partition.mode=nonstrict") - sql("CREATE TABLE spark_28054_test (KEY STRING, VALUE STRING) PARTITIONED BY (DS STRING)") +withSQLConf("hive.exec.dynamic.partition.mode" -> "nonstrict") { + withTable("spark_28054_test") { +sql("CREATE TABLE spark_28054_test (KEY STRING, VALUE STRING) PARTITIONED BY (DS STRING)") - sql("INSERT INTO TABLE spark_28054_test PARTITION(DS) SELECT 'k' KEY, 'v' VALUE, '1' DS") +sql("INSERT INTO TABLE spark_28054_test PARTITION(DS) SELECT 'k' KEY, 'v' VALUE, '1' DS") - assertResult(Array(Row("k", "v", "1"))) { -sql("SELECT * from spark_28054_test").collect() - } +assertResult(Array(Row("k", "v", "1"))) { + sql("SELECT * from spark_28054_test").collect() +} - sql("INSERT INTO TABLE spark_28054_test PARTITION(ds) SELECT 'k' key, 'v' value, '2' ds") - assertResult(Array(Row("k", "v", "1"), Row("k", "v", "2"))) { -sql("SELECT * from spark_28054_test").collect() +sql("INSERT INTO TABLE spark_28054_test PARTITION(ds) SELECT 'k' key, 'v' value, '2' ds") +assertResult(Array(Row("k", "v", "1"), Row("k", "v", "2"))) { + sql("SELECT * from spark_28054_test").collect() +} } } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0a4f985c -> 02f4763)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 0a4f985c [SPARK-23098][SQL] Migrate Kafka Batch source to v2. add 02f4763 [SPARK-28198][PYTHON] Add mapPartitionsInPandas to allow an iterator of DataFrames No new revisions were added by this update. Summary of changes: .../org/apache/spark/api/python/PythonRunner.scala | 2 + python/pyspark/rdd.py | 1 + python/pyspark/sql/dataframe.py| 48 +++- python/pyspark/sql/tests/test_pandas_udf_iter.py | 135 + python/pyspark/worker.py | 62 ++ .../plans/logical/pythonLogicalOperators.scala | 12 ++ .../main/scala/org/apache/spark/sql/Dataset.scala | 21 +++- .../spark/sql/execution/SparkStrategies.scala | 2 + .../python/MapPartitionsInPandasExec.scala | 95 +++ 9 files changed, 353 insertions(+), 25 deletions(-) create mode 100644 python/pyspark/sql/tests/test_pandas_udf_iter.py create mode 100644 sql/core/src/main/scala/org/apache/spark/sql/execution/python/MapPartitionsInPandasExec.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix documentation
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 048224c [SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix documentation 048224c is described below commit 048224ce9a3bdb304ba24852ecc66c7f14c25c11 Author: Marco Gaido AuthorDate: Mon Jul 1 11:40:12 2019 +0900 [SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix documentation ## What changes were proposed in this pull request? The documentation in `linalg.py` is not consistent. This PR uniforms the documentation. ## How was this patch tested? NA Closes #25011 from mgaido91/SPARK-28170. Authored-by: Marco Gaido Signed-off-by: HyukjinKwon --- python/pyspark/ml/linalg/__init__.py | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/python/pyspark/ml/linalg/__init__.py b/python/pyspark/ml/linalg/__init__.py index f99161c..f6ddc09 100644 --- a/python/pyspark/ml/linalg/__init__.py +++ b/python/pyspark/ml/linalg/__init__.py @@ -386,14 +386,14 @@ class DenseVector(Vector): def toArray(self): """ -Returns an numpy.ndarray +Returns the underlying numpy.ndarray """ return self.array @property def values(self): """ -Returns a list of values +Returns the underlying numpy.ndarray """ return self.array @@ -681,7 +681,7 @@ class SparseVector(Vector): def toArray(self): """ -Returns a copy of this SparseVector as a 1-dimensional NumPy array. +Returns a copy of this SparseVector as a 1-dimensional numpy.ndarray. """ arr = np.zeros((self.size,), dtype=np.float64) arr[self.indices] = self.values @@ -862,7 +862,7 @@ class Matrix(object): def toArray(self): """ -Returns its elements in a NumPy ndarray. +Returns its elements in a numpy.ndarray. """ raise NotImplementedError @@ -937,7 +937,7 @@ class DenseMatrix(Matrix): def toArray(self): """ -Return an numpy.ndarray +Return a numpy.ndarray >>> m = DenseMatrix(2, 2, range(4)) >>> m.toArray() @@ -1121,7 +1121,7 @@ class SparseMatrix(Matrix): def toArray(self): """ -Return an numpy.ndarray +Return a numpy.ndarray """ A = np.zeros((self.numRows, self.numCols), dtype=np.float64, order='F') for k in xrange(self.colPtrs.size - 1): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix documentation
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new d57b392 [SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix documentation d57b392 is described below commit d57b392961167bb8dfd888d79efa947277d62ec9 Author: Marco Gaido AuthorDate: Mon Jul 1 11:40:12 2019 +0900 [SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix documentation ## What changes were proposed in this pull request? The documentation in `linalg.py` is not consistent. This PR uniforms the documentation. ## How was this patch tested? NA Closes #25011 from mgaido91/SPARK-28170. Authored-by: Marco Gaido Signed-off-by: HyukjinKwon (cherry picked from commit 048224ce9a3bdb304ba24852ecc66c7f14c25c11) Signed-off-by: HyukjinKwon --- python/pyspark/ml/linalg/__init__.py | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/python/pyspark/ml/linalg/__init__.py b/python/pyspark/ml/linalg/__init__.py index 9da9836..9261d65 100644 --- a/python/pyspark/ml/linalg/__init__.py +++ b/python/pyspark/ml/linalg/__init__.py @@ -386,14 +386,14 @@ class DenseVector(Vector): def toArray(self): """ -Returns an numpy.ndarray +Returns the underlying numpy.ndarray """ return self.array @property def values(self): """ -Returns a list of values +Returns the underlying numpy.ndarray """ return self.array @@ -681,7 +681,7 @@ class SparseVector(Vector): def toArray(self): """ -Returns a copy of this SparseVector as a 1-dimensional NumPy array. +Returns a copy of this SparseVector as a 1-dimensional numpy.ndarray. """ arr = np.zeros((self.size,), dtype=np.float64) arr[self.indices] = self.values @@ -862,7 +862,7 @@ class Matrix(object): def toArray(self): """ -Returns its elements in a NumPy ndarray. +Returns its elements in a numpy.ndarray. """ raise NotImplementedError @@ -937,7 +937,7 @@ class DenseMatrix(Matrix): def toArray(self): """ -Return an numpy.ndarray +Return a numpy.ndarray >>> m = DenseMatrix(2, 2, range(4)) >>> m.toArray() @@ -1121,7 +1121,7 @@ class SparseMatrix(Matrix): def toArray(self): """ -Return an numpy.ndarray +Return a numpy.ndarray """ A = np.zeros((self.numRows, self.numCols), dtype=np.float64, order='F') for k in xrange(self.colPtrs.size - 1): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28215][SQL][R] as_tibble was removed from Arrow R API
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7083ec0 [SPARK-28215][SQL][R] as_tibble was removed from Arrow R API 7083ec0 is described below commit 7083ec051ed47b9c6500f2107650814f0ff9206f Author: Liang-Chi Hsieh AuthorDate: Mon Jul 1 13:21:06 2019 +0900 [SPARK-28215][SQL][R] as_tibble was removed from Arrow R API ## What changes were proposed in this pull request? New R api of Arrow has removed `as_tibble` as of https://github.com/apache/arrow/commit/2ef96c8623cbad1770f82e97df733bd881ab967b. Arrow optimization for DataFrame in R doesn't work due to the change. This can be tested as below, after installing latest Arrow: ``` ./bin/sparkR --conf spark.sql.execution.arrow.sparkr.enabled=true ``` ``` > collect(createDataFrame(mtcars)) ``` Before this PR: ``` > collect(createDataFrame(mtcars)) Error in get("as_tibble", envir = asNamespace("arrow")) : object 'as_tibble' not found ``` After: ``` > collect(createDataFrame(mtcars)) mpg cyl disp hp dratwt qsec vs am gear carb 1 21.0 6 160.0 110 3.90 2.620 16.46 0 144 2 21.0 6 160.0 110 3.90 2.875 17.02 0 144 3 22.8 4 108.0 93 3.85 2.320 18.61 1 141 ... ``` ## How was this patch tested? Manual test. Closes #25012 from viirya/SPARK-28215. Authored-by: Liang-Chi Hsieh Signed-off-by: HyukjinKwon --- R/pkg/R/DataFrame.R | 10 -- R/pkg/R/deserialize.R | 13 ++--- 2 files changed, 18 insertions(+), 5 deletions(-) diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 439cad0..6f3c7c1 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -1203,7 +1203,8 @@ setMethod("collect", requireNamespace1 <- requireNamespace if (requireNamespace1("arrow", quietly = TRUE)) { read_arrow <- get("read_arrow", envir = asNamespace("arrow"), inherits = FALSE) -as_tibble <- get("as_tibble", envir = asNamespace("arrow")) +# Arrow drops `as_tibble` since 0.14.0, see ARROW-5190. +useAsTibble <- exists("as_tibble", envir = asNamespace("arrow")) portAuth <- callJMethod(x@sdf, "collectAsArrowToR") port <- portAuth[[1]] @@ -1213,7 +1214,12 @@ setMethod("collect", output <- tryCatch({ doServerAuth(conn, authSecret) arrowTable <- read_arrow(readRaw(conn)) - as.data.frame(as_tibble(arrowTable), stringsAsFactors = stringsAsFactors) + if (useAsTibble) { +as_tibble <- get("as_tibble", envir = asNamespace("arrow")) +as.data.frame(as_tibble(arrowTable), stringsAsFactors = stringsAsFactors) + } else { +as.data.frame(arrowTable, stringsAsFactors = stringsAsFactors) + } }, finally = { close(conn) }) diff --git a/R/pkg/R/deserialize.R b/R/pkg/R/deserialize.R index 191c51e..b38d245 100644 --- a/R/pkg/R/deserialize.R +++ b/R/pkg/R/deserialize.R @@ -237,7 +237,9 @@ readDeserializeInArrow <- function(inputCon) { if (requireNamespace1("arrow", quietly = TRUE)) { RecordBatchStreamReader <- get( "RecordBatchStreamReader", envir = asNamespace("arrow"), inherits = FALSE) -as_tibble <- get("as_tibble", envir = asNamespace("arrow")) +# Arrow drops `as_tibble` since 0.14.0, see ARROW-5190. +useAsTibble <- exists("as_tibble", envir = asNamespace("arrow")) + # Currently, there looks no way to read batch by batch by socket connection in R side, # See ARROW-4512. Therefore, it reads the whole Arrow streaming-formatted binary at once @@ -246,8 +248,13 @@ readDeserializeInArrow <- function(inputCon) { arrowData <- readBin(inputCon, raw(), as.integer(dataLen), endian = "big") batches <- RecordBatchStreamReader(arrowData)$batches() -# Read all groupped batches. Tibble -> data.frame is cheap. -lapply(batches, function(batch) as.data.frame(as_tibble(batch))) +if (useAsTibble) { + as_tibble <- get("as_tibble", envir = asNamespace("arrow")) + # Read all groupped batches. Tibble -> data.frame is cheap. + lapply(batches, function(batch) as.data.frame(as_tibble(batch))) +} else { + lapply(batches, function(batch) as.data.frame(batch)) +} } else { stop("'arrow' package should be installed.") } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28204][SQL][TESTS] Make separate two test cases for column pruning in binary files
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new facf9c3 [SPARK-28204][SQL][TESTS] Make separate two test cases for column pruning in binary files facf9c3 is described below commit facf9c30a283ec682b5adb2e7afdbf5d011e3808 Author: HyukjinKwon AuthorDate: Sat Jun 29 14:05:23 2019 +0900 [SPARK-28204][SQL][TESTS] Make separate two test cases for column pruning in binary files ## What changes were proposed in this pull request? SPARK-27534 missed to address my own comments at https://github.com/WeichenXu123/spark/pull/8 It's better to push this in since the codes are already cleaned up. ## How was this patch tested? Unittests fixed Closes #25003 from HyukjinKwon/SPARK-27534. Authored-by: HyukjinKwon Signed-off-by: HyukjinKwon --- .../binaryfile/BinaryFileFormatSuite.scala | 88 +++--- 1 file changed, 43 insertions(+), 45 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala index 9e2969b..a66b34f 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala @@ -290,56 +290,54 @@ class BinaryFileFormatSuite extends QueryTest with SharedSQLContext with SQLTest ), true) } + private def readBinaryFile(file: File, requiredSchema: StructType): Row = { +val format = new BinaryFileFormat +val reader = format.buildReaderWithPartitionValues( + sparkSession = spark, + dataSchema = schema, + partitionSchema = StructType(Nil), + requiredSchema = requiredSchema, + filters = Seq.empty, + options = Map.empty, + hadoopConf = spark.sessionState.newHadoopConf() +) +val partitionedFile = mock(classOf[PartitionedFile]) +when(partitionedFile.filePath).thenReturn(file.getPath) +val encoder = RowEncoder(requiredSchema).resolveAndBind() +encoder.fromRow(reader(partitionedFile).next()) + } + test("column pruning") { -def getRequiredSchema(fieldNames: String*): StructType = { - StructType(fieldNames.map { -case f if schema.fieldNames.contains(f) => schema(f) -case other => StructField(other, NullType) - }) -} -def read(file: File, requiredSchema: StructType): Row = { - val format = new BinaryFileFormat - val reader = format.buildReaderWithPartitionValues( -sparkSession = spark, -dataSchema = schema, -partitionSchema = StructType(Nil), -requiredSchema = requiredSchema, -filters = Seq.empty, -options = Map.empty, -hadoopConf = spark.sessionState.newHadoopConf() - ) - val partitionedFile = mock(classOf[PartitionedFile]) - when(partitionedFile.filePath).thenReturn(file.getPath) - val encoder = RowEncoder(requiredSchema).resolveAndBind() - encoder.fromRow(reader(partitionedFile).next()) -} -val file = new File(Utils.createTempDir(), "data") -val content = "123".getBytes -Files.write(file.toPath, content, StandardOpenOption.CREATE, StandardOpenOption.WRITE) - -read(file, getRequiredSchema(MODIFICATION_TIME, CONTENT, LENGTH, PATH)) match { - case Row(t, c, len, p) => -assert(t === new Timestamp(file.lastModified())) -assert(c === content) -assert(len === content.length) -assert(p.asInstanceOf[String].endsWith(file.getAbsolutePath)) +withTempPath { file => + val content = "123".getBytes + Files.write(file.toPath, content, StandardOpenOption.CREATE, StandardOpenOption.WRITE) + + val actual = readBinaryFile(file, StructType(schema.takeRight(3))) + val expected = Row(new Timestamp(file.lastModified()), content.length, content) + + assert(actual === expected) } -file.setReadable(false) -withClue("cannot read content") { + } + + test("column pruning - non-readable file") { +withTempPath { file => + val content = "abc".getBytes + Files.write(file.toPath, content, StandardOpenOption.CREATE, StandardOpenOption.WRITE) + file.setReadable(false) + + // If content is selected, it throws an exception because it's not readable. intercept[IOException] { -read(file, getRequiredSchema(CONTENT)) +readBinaryFile(file, StructType(schema(CONTENT) :: Nil)) } -} -assert(read(file, getRequiredSchema(LENGTH)) === Row(content.length), - "
[spark] branch master updated: [SPARK-28422][SQL][PYTHON] GROUPED_AGG pandas_udf should work without group by clause
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e6a0385 [SPARK-28422][SQL][PYTHON] GROUPED_AGG pandas_udf should work without group by clause e6a0385 is described below commit e6a0385289f2d2fec05d3fb5f798903de292c381 Author: Liang-Chi Hsieh AuthorDate: Wed Aug 14 00:32:33 2019 +0900 [SPARK-28422][SQL][PYTHON] GROUPED_AGG pandas_udf should work without group by clause ## What changes were proposed in this pull request? A GROUPED_AGG pandas python udf can't work, if without group by clause, like `select udf(id) from table`. This doesn't match with aggregate function like sum, count..., and also dataset API like `df.agg(udf(df['id']))`. When we parse a udf (or an aggregate function) like that from SQL syntax, it is known as a function in a project. `GlobalAggregates` rule in analysis makes such project as aggregate, by looking for aggregate expressions. At the moment, we should also look for GROUPED_AGG pandas python udf. ## How was this patch tested? Added tests. Closes #25352 from viirya/SPARK-28422. Authored-by: Liang-Chi Hsieh Signed-off-by: HyukjinKwon --- .../sql/tests/test_pandas_udf_grouped_agg.py | 15 ++ .../spark/sql/catalyst/analysis/Analyzer.scala | 5 - .../apache/spark/sql/catalyst/plans/PlanTest.scala | 2 ++ .../python/BatchEvalPythonExecSuite.scala | 24 +- 4 files changed, 44 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/tests/test_pandas_udf_grouped_agg.py b/python/pyspark/sql/tests/test_pandas_udf_grouped_agg.py index 041b2b5..6d460df 100644 --- a/python/pyspark/sql/tests/test_pandas_udf_grouped_agg.py +++ b/python/pyspark/sql/tests/test_pandas_udf_grouped_agg.py @@ -474,6 +474,21 @@ class GroupedAggPandasUDFTests(ReusedSQLTestCase): result = df.groupBy('id').agg(f(df['x']).alias('sum')).collect() self.assertEqual(result, expected) +def test_grouped_without_group_by_clause(self): +@pandas_udf('double', PandasUDFType.GROUPED_AGG) +def max_udf(v): +return v.max() + +df = self.spark.range(0, 100) +self.spark.udf.register('max_udf', max_udf) + +with self.tempView("table"): +df.createTempView('table') + +agg1 = df.agg(max_udf(df['id'])) +agg2 = self.spark.sql("select max_udf(id) from table") +assert_frame_equal(agg1.toPandas(), agg2.toPandas()) + if __name__ == "__main__": from pyspark.sql.tests.test_pandas_udf_grouped_agg import * diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index f8eef0c..5a04d57 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1799,15 +1799,18 @@ class Analyzer( def containsAggregates(exprs: Seq[Expression]): Boolean = { // Collect all Windowed Aggregate Expressions. - val windowedAggExprs = exprs.flatMap { expr => + val windowedAggExprs: Set[Expression] = exprs.flatMap { expr => expr.collect { case WindowExpression(ae: AggregateExpression, _) => ae + case WindowExpression(e: PythonUDF, _) if PythonUDF.isGroupedAggPandasUDF(e) => e } }.toSet // Find the first Aggregate Expression that is not Windowed. exprs.exists(_.collectFirst { case ae: AggregateExpression if !windowedAggExprs.contains(ae) => ae +case e: PythonUDF if PythonUDF.isGroupedAggPandasUDF(e) && + !windowedAggExprs.contains(e) => e }.isDefined) } } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala index 6e2a842..08f1f87 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala @@ -81,6 +81,8 @@ trait PlanTestBase extends PredicateHelper with SQLHelper { self: Suite => ae.copy(resultId = ExprId(0)) case lv: NamedLambdaVariable => lv.copy(exprId = ExprId(0), value = null) + case udf: PythonUDF => +udf.copy(resultId = ExprId(0)) } } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala index 289cc66..ac5752b 100644 ---
[spark] branch master updated: [SPARK-28556][SQL] QueryExecutionListener should also notify Error
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 196a4d7 [SPARK-28556][SQL] QueryExecutionListener should also notify Error 196a4d7 is described below commit 196a4d7117b122fdb1fc95241613d012e8292734 Author: Shixiong Zhu AuthorDate: Tue Jul 30 11:47:36 2019 +0900 [SPARK-28556][SQL] QueryExecutionListener should also notify Error ## What changes were proposed in this pull request? Right now `Error` is not sent to `QueryExecutionListener.onFailure`. If there is any `Error` (such as `AssertionError`) when running a query, `QueryExecutionListener.onFailure` cannot be triggered. This PR changes `onFailure` to accept a `Throwable` instead. ## How was this patch tested? Jenkins Closes #25292 from zsxwing/fix-QueryExecutionListener. Authored-by: Shixiong Zhu Signed-off-by: HyukjinKwon --- project/MimaExcludes.scala | 6 +- .../apache/spark/sql/execution/SQLExecution.scala | 4 ++-- .../spark/sql/execution/ui/SQLListener.scala | 2 +- .../spark/sql/util/QueryExecutionListener.scala| 4 ++-- .../org/apache/spark/sql/SessionStateSuite.scala | 2 +- .../spark/sql/TestQueryExecutionListener.scala | 2 +- .../test/scala/org/apache/spark/sql/UDFSuite.scala | 2 +- .../sources/v2/FileDataSourceV2FallBackSuite.scala | 6 +++--- .../sql/test/DataFrameReaderWriterSuite.scala | 2 +- .../spark/sql/util/DataFrameCallbackSuite.scala| 24 +++--- .../sql/util/ExecutionListenerManagerSuite.scala | 2 +- 11 files changed, 30 insertions(+), 26 deletions(-) diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala index 5978f88..51d5861 100644 --- a/project/MimaExcludes.scala +++ b/project/MimaExcludes.scala @@ -376,7 +376,11 @@ object MimaExcludes { // [SPARK-28199][SS] Remove deprecated ProcessingTime ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.streaming.ProcessingTime"), - ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.streaming.ProcessingTime$") + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.streaming.ProcessingTime$"), + +// [SPARK-28556][SQL] QueryExecutionListener should also notify Error + ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.util.QueryExecutionListener.onFailure"), + ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.sql.util.QueryExecutionListener.onFailure") ) // Exclude rules for 2.4.x diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala index ca66337..6046805 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala @@ -85,7 +85,7 @@ object SQLExecution { }.getOrElse(callSite.shortForm) withSQLConfPropagated(sparkSession) { -var ex: Option[Exception] = None +var ex: Option[Throwable] = None val startTime = System.nanoTime() try { sc.listenerBus.post(SparkListenerSQLExecutionStart( @@ -99,7 +99,7 @@ object SQLExecution { time = System.currentTimeMillis())) body } catch { - case e: Exception => + case e: Throwable => ex = Some(e) throw e } finally { diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala index 67d1f27..81cbc7f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala @@ -60,7 +60,7 @@ case class SparkListenerSQLExecutionEnd(executionId: Long, time: Long) @JsonIgnore private[sql] var qe: QueryExecution = null // The exception object that caused this execution to fail. None if the execution doesn't fail. - @JsonIgnore private[sql] var executionFailure: Option[Exception] = None + @JsonIgnore private[sql] var executionFailure: Option[Throwable] = None } /** diff --git a/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala b/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala index 77ae047..2da8469 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala @@ -58,12 +58,12 @@ trait QueryExecutionListener { * @param funcName the name
[spark] branch branch-2.3 updated: [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.3 by this push: new e686178 [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 e686178 is described below commit e686178ffe7ceb33ee42558ee0b4fe28c417124d Author: WeichenXu AuthorDate: Sat Aug 3 10:31:15 2019 +0900 [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 This PR picks up https://github.com/apache/spark/pull/25315 back after removing `Popen.wait` usage which exists in Python 3 only. I saw the last test results wrongly and thought it was passed. Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I add a sleep after the test connection to daemon. Run test ``` python/run-tests --python-executables=python3.7 --testname "pyspark.tests.test_daemon DaemonTests" ``` **Before** Fail on test "test_termination_sigterm". And we can see daemon process do not exit. **After** Test passed Closes #25343 from HyukjinKwon/SPARK-28582. Authored-by: WeichenXu Signed-off-by: HyukjinKwon (cherry picked from commit b3394db1930b3c9f55438cb27bb2c584bf041f8e) Signed-off-by: HyukjinKwon --- python/pyspark/tests.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py index 5ba121f..e1c780f 100644 --- a/python/pyspark/tests.py +++ b/python/pyspark/tests.py @@ -1862,6 +1862,9 @@ class DaemonTests(unittest.TestCase): # daemon should accept connections self.assertTrue(self.connect(port)) +# wait worker process spawned from daemon exit. +time.sleep(1) + # request shutdown terminator(daemon) time.sleep(1) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b3394db [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 b3394db is described below commit b3394db1930b3c9f55438cb27bb2c584bf041f8e Author: WeichenXu AuthorDate: Sat Aug 3 10:31:15 2019 +0900 [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 ## What changes were proposed in this pull request? This PR picks up https://github.com/apache/spark/pull/25315 back after removing `Popen.wait` usage which exists in Python 3 only. I saw the last test results wrongly and thought it was passed. Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I add a sleep after the test connection to daemon. ## How was this patch tested? Run test ``` python/run-tests --python-executables=python3.7 --testname "pyspark.tests.test_daemon DaemonTests" ``` **Before** Fail on test "test_termination_sigterm". And we can see daemon process do not exit. **After** Test passed Closes #25343 from HyukjinKwon/SPARK-28582. Authored-by: WeichenXu Signed-off-by: HyukjinKwon --- python/pyspark/tests/test_daemon.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/python/pyspark/tests/test_daemon.py b/python/pyspark/tests/test_daemon.py index 2cdc16c..898fb39 100644 --- a/python/pyspark/tests/test_daemon.py +++ b/python/pyspark/tests/test_daemon.py @@ -47,6 +47,9 @@ class DaemonTests(unittest.TestCase): # daemon should accept connections self.assertTrue(self.connect(port)) +# wait worker process spawned from daemon exit. +time.sleep(1) + # request shutdown terminator(daemon) time.sleep(1) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 6c61321 [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 6c61321 is described below commit 6c613210cd1075f771376e7ecc6dfb08bd620716 Author: WeichenXu AuthorDate: Sat Aug 3 10:31:15 2019 +0900 [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 This PR picks up https://github.com/apache/spark/pull/25315 back after removing `Popen.wait` usage which exists in Python 3 only. I saw the last test results wrongly and thought it was passed. Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I add a sleep after the test connection to daemon. Run test ``` python/run-tests --python-executables=python3.7 --testname "pyspark.tests.test_daemon DaemonTests" ``` **Before** Fail on test "test_termination_sigterm". And we can see daemon process do not exit. **After** Test passed Closes #25343 from HyukjinKwon/SPARK-28582. Authored-by: WeichenXu Signed-off-by: HyukjinKwon (cherry picked from commit b3394db1930b3c9f55438cb27bb2c584bf041f8e) Signed-off-by: HyukjinKwon --- python/pyspark/tests.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py index a2d825b..26c9126 100644 --- a/python/pyspark/tests.py +++ b/python/pyspark/tests.py @@ -1939,6 +1939,9 @@ class DaemonTests(unittest.TestCase): # daemon should accept connections self.assertTrue(self.connect(port)) +# wait worker process spawned from daemon exit. +time.sleep(1) + # request shutdown terminator(daemon) time.sleep(1) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new dc09a02 [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 dc09a02 is described below commit dc09a02c142d3787e728c8b25eb8417649d98e9f Author: WeichenXu AuthorDate: Fri Aug 2 22:07:06 2019 +0900 [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I add a sleep after the test connection to daemon. Run test ``` python/run-tests --python-executables=python3.7 --testname "pyspark.tests.test_daemon DaemonTests" ``` **Before** Fail on test "test_termination_sigterm". And we can see daemon process do not exit. **After** Test passed Closes #25315 from WeichenXu123/fix_py37_daemon. Authored-by: WeichenXu Signed-off-by: HyukjinKwon (cherry picked from commit fbeee0c5bcea32346b2279c5b67044f12e5faf7d) Signed-off-by: HyukjinKwon --- python/pyspark/tests.py | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py index a2d825b..fef2959 100644 --- a/python/pyspark/tests.py +++ b/python/pyspark/tests.py @@ -1916,7 +1916,7 @@ class OutputFormatTests(ReusedPySparkTestCase): class DaemonTests(unittest.TestCase): def connect(self, port): -from socket import socket, AF_INET, SOCK_STREAM +from socket import socket, AF_INET, SOCK_STREAM# request shutdown sock = socket(AF_INET, SOCK_STREAM) sock.connect(('127.0.0.1', port)) # send a split index of -1 to shutdown the worker @@ -1939,9 +1939,12 @@ class DaemonTests(unittest.TestCase): # daemon should accept connections self.assertTrue(self.connect(port)) +# wait worker process spawned from daemon exit. +time.sleep(1) + # request shutdown terminator(daemon) -time.sleep(1) +daemon.wait(5) # daemon should no longer accept connections try: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: Revert "[SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new a065a50 Revert "[SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7" a065a50 is described below commit a065a503bcf1ec5b7d49c575af7bc6867c734d90 Author: HyukjinKwon AuthorDate: Fri Aug 2 22:12:04 2019 +0900 Revert "[SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7" This reverts commit dc09a02c142d3787e728c8b25eb8417649d98e9f. --- python/pyspark/tests.py | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py index fef2959..a2d825b 100644 --- a/python/pyspark/tests.py +++ b/python/pyspark/tests.py @@ -1916,7 +1916,7 @@ class OutputFormatTests(ReusedPySparkTestCase): class DaemonTests(unittest.TestCase): def connect(self, port): -from socket import socket, AF_INET, SOCK_STREAM# request shutdown +from socket import socket, AF_INET, SOCK_STREAM sock = socket(AF_INET, SOCK_STREAM) sock.connect(('127.0.0.1', port)) # send a split index of -1 to shutdown the worker @@ -1939,12 +1939,9 @@ class DaemonTests(unittest.TestCase): # daemon should accept connections self.assertTrue(self.connect(port)) -# wait worker process spawned from daemon exit. -time.sleep(1) - # request shutdown terminator(daemon) -daemon.wait(5) +time.sleep(1) # daemon should no longer accept connections try: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 20e46ef [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 20e46ef is described below commit 20e46ef6e3e49e754062717c2cb249c6eb99e86a Author: WeichenXu AuthorDate: Fri Aug 2 22:07:06 2019 +0900 [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I add a sleep after the test connection to daemon. Run test ``` python/run-tests --python-executables=python3.7 --testname "pyspark.tests.test_daemon DaemonTests" ``` **Before** Fail on test "test_termination_sigterm". And we can see daemon process do not exit. **After** Test passed Closes #25315 from WeichenXu123/fix_py37_daemon. Authored-by: WeichenXu Signed-off-by: HyukjinKwon (cherry picked from commit fbeee0c5bcea32346b2279c5b67044f12e5faf7d) Signed-off-by: HyukjinKwon --- python/pyspark/tests.py | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py index a2d825b..fc0ed41 100644 --- a/python/pyspark/tests.py +++ b/python/pyspark/tests.py @@ -1939,9 +1939,12 @@ class DaemonTests(unittest.TestCase): # daemon should accept connections self.assertTrue(self.connect(port)) +# wait worker process spawned from daemon exit. +time.sleep(1) + # request shutdown terminator(daemon) -time.sleep(1) +daemon.wait(5) # daemon should no longer accept connections try: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.3 updated: [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.3 by this push: new 21ef0fd [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 21ef0fd is described below commit 21ef0fdf7244a169ac0c3e701cb21c35f1038a5d Author: WeichenXu AuthorDate: Fri Aug 2 22:07:06 2019 +0900 [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7 Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I add a sleep after the test connection to daemon. Run test ``` python/run-tests --python-executables=python3.7 --testname "pyspark.tests.test_daemon DaemonTests" ``` **Before** Fail on test "test_termination_sigterm". And we can see daemon process do not exit. **After** Test passed Closes #25315 from WeichenXu123/fix_py37_daemon. Authored-by: WeichenXu Signed-off-by: HyukjinKwon (cherry picked from commit fbeee0c5bcea32346b2279c5b67044f12e5faf7d) Signed-off-by: HyukjinKwon --- python/pyspark/tests.py | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py index 5ba121f..5fc6887 100644 --- a/python/pyspark/tests.py +++ b/python/pyspark/tests.py @@ -1862,9 +1862,12 @@ class DaemonTests(unittest.TestCase): # daemon should accept connections self.assertTrue(self.connect(port)) +# wait worker process spawned from daemon exit. +time.sleep(1) + # request shutdown terminator(daemon) -time.sleep(1) +daemon.wait(5) # daemon should no longer accept connections try: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (babdba0 -> ef14237)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from babdba0 [SPARK-28728][BUILD] Bump Jackson Databind to 2.9.9.3 add ef14237 [SPARK-28736][SPARK-28735][PYTHON][ML] Fix PySpark ML tests to pass in JDK 11 No new revisions were added by this update. Summary of changes: python/pyspark/ml/tests/test_algorithms.py | 2 +- python/pyspark/mllib/clustering.py | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (31ef268 -> 37eedf6)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 31ef268 [SPARK-28639][CORE][DOC] Configuration doc for Barrier Execution Mode add 37eedf6 [SPARK-28652][TESTS][K8S] Add python version check for executor No new revisions were added by this update. Summary of changes: .../spark/deploy/k8s/integrationtest/PythonTestsSuite.scala | 6 -- resource-managers/kubernetes/integration-tests/tests/pyfiles.py | 8 2 files changed, 12 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3a4afce -> 3ec24fd)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3a4afce [SPARK-28687][SQL] Support `epoch`, `isoyear`, `milliseconds` and `microseconds` at `extract()` add 3ec24fd [SPARK-28203][CORE][PYTHON] PythonRDD should respect SparkContext's hadoop configuration No new revisions were added by this update. Summary of changes: .../apache/spark/api/python/PythonHadoopUtil.scala | 2 +- .../org/apache/spark/api/python/PythonRDD.scala| 6 +- .../apache/spark/api/python/PythonRDDSuite.scala | 87 +- 3 files changed, 88 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (f0834d3 -> 4ddad79)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from f0834d3 Revert "[SPARK-28527][SQL][TEST] Re-run all the tests in SQLQueryTestSuite via Thrift Server" add 4ddad79 [SPARK-28598][SQL] Few date time manipulation functions does not provide versions supporting Column as input through the Dataframe API No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/sql/functions.scala | 46 -- .../org/apache/spark/sql/DateFunctionsSuite.scala | 11 ++ 2 files changed, 53 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (4ddad79 -> c96b615)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4ddad79 [SPARK-28598][SQL] Few date time manipulation functions does not provide versions supporting Column as input through the Dataframe API add c96b615 [SPARK-28390][SQL][PYTHON][TESTS][FOLLOW-UP] Update the TODO with actual blocking JIRA IDs No new revisions were added by this update. Summary of changes: .../src/test/resources/sql-tests/inputs/udf/pgSQL/udf-select_having.sql | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28578][INFRA] Improve Github pull request template
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0ea8db9 [SPARK-28578][INFRA] Improve Github pull request template 0ea8db9 is described below commit 0ea8db9fd3d882140d8fa305dd69fc94db62cf8f Author: HyukjinKwon AuthorDate: Fri Aug 16 09:45:15 2019 +0900 [SPARK-28578][INFRA] Improve Github pull request template ### What changes were proposed in this pull request? This PR proposes to improve the Github template for better and faster review iterations and better interactions between PR authors and reviewers. As suggested in the the [dev mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-New-sections-in-Github-Pull-Request-description-template-td27527.html), this PR referred [Kubernates' PR template](https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md). Therefore, those fields are newly added: ``` ### Why are the changes needed? ### Does this PR introduce any user-facing change? ``` and some comments were added. ### Why are the changes needed? Currently, many PR descriptions are poorly formatted, which causes some overheads between PR authors and reviewers. There are multiple problems by those poorly formatted PR descriptions: - Some PRs still write single line in PR description with 500+- code changes in a critical path. - Some PRs do not describe behaviour changes and reviewers need to find and document. - Some PRs are hard to review without outlines but they are not mentioned sometimes. - Spark is being old and sometimes we need to track the history deep. Due to poorly formatted PR description, sometimes it requires to read whole codes of whole commit histories to find the root cause of a bug. - Reviews take a while but the number of PR still grows. This PR targets to alleviate the problems and situation. ### Does this PR introduce any user-facing change? Yes, it changes the PR templates when PRs are open. This PR uses the template this PR proposes. ### How was this patch tested? Manually tested via Github preview feature. Closes #25310 from HyukjinKwon/SPARK-28578. Lead-authored-by: HyukjinKwon Co-authored-by: Hyukjin Kwon Signed-off-by: HyukjinKwon --- .github/PULL_REQUEST_TEMPLATE | 44 +-- 1 file changed, 38 insertions(+), 6 deletions(-) diff --git a/.github/PULL_REQUEST_TEMPLATE b/.github/PULL_REQUEST_TEMPLATE index e7ed23d..be57f00 100644 --- a/.github/PULL_REQUEST_TEMPLATE +++ b/.github/PULL_REQUEST_TEMPLATE @@ -1,10 +1,42 @@ -## What changes were proposed in this pull request? + -(Please fill in changes proposed in this fix) +### What changes were proposed in this pull request? + -## How was this patch tested? -(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) -(If this patch involves UI changes, please attach a screenshot; otherwise, remove this) +### Why are the changes needed? + -Please review https://spark.apache.org/contributing.html before opening a pull request. + +### Does this PR introduce any user-facing change? + + + +### How was this patch tested? + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (25857c6 -> ec84415)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 25857c6 [SPARK-28647][WEBUI] Recover additional metric feature and remove additional-metrics.js add ec84415 [SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'udf-group-by.sql' No new revisions were added by this update. Summary of changes: .../sql-tests/inputs/udf/udf-group-by.sql | 28 +-- .../sql-tests/results/udf/udf-group-by.sql.out | 273 +++-- 2 files changed, 154 insertions(+), 147 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: Revert "[SPARK-28759][BUILD] Upgrade scala-maven-plugin to 4.1.1"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1de4a22 Revert "[SPARK-28759][BUILD] Upgrade scala-maven-plugin to 4.1.1" 1de4a22 is described below commit 1de4a22c52779bbdf68e40167a91e8606225f6b7 Author: HyukjinKwon AuthorDate: Mon Aug 19 20:31:39 2019 +0900 Revert "[SPARK-28759][BUILD] Upgrade scala-maven-plugin to 4.1.1" This reverts commit 1819a6f22eee5314197aab4c169c74bd6ff6c17c. --- pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pom.xml b/pom.xml index 3b03833..140d19e 100644 --- a/pom.xml +++ b/pom.xml @@ -2280,7 +2280,7 @@ net.alchim31.maven scala-maven-plugin - 4.1.1 + 3.4.4 eclipse-add-source - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28756][R][FOLLOW-UP] Specify minimum and maximum Java versions
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2fd83c2 [SPARK-28756][R][FOLLOW-UP] Specify minimum and maximum Java versions 2fd83c2 is described below commit 2fd83c28203ef9c300a3feaaecc8edb5546814cf Author: HyukjinKwon AuthorDate: Mon Aug 19 20:15:17 2019 +0900 [SPARK-28756][R][FOLLOW-UP] Specify minimum and maximum Java versions ### What changes were proposed in this pull request? This PR proposes to set minimum and maximum Java version specification. (see https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-portable-packages). Seems there is not the standard way to specify both given the documentation and other packages (see https://gist.github.com/glin/bd36cf1eb0c7f8b1f511e70e2fb20f8d). I found two ways from existing packages on CRAN. ``` Package (<= 1 & > 2) Package (<= 1, > 2) ``` The latter seems closer to other standard notations such as `R (>= 2.14.0), R (>= r56550)`. So I have chosen the latter way. ### Why are the changes needed? Seems the package might be rejected by CRAN. See https://github.com/apache/spark/pull/25472#issuecomment-522405742 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? JDK 8 ```bash ./build/mvn -DskipTests -Psparkr clean package ./R/run-tests.sh ... basic tests for CRAN: . ... ``` JDK 11 ```bash ./build/mvn -DskipTests -Psparkr -Phadoop-3.2 clean package ./R/run-tests.sh ... basic tests for CRAN: . ... ``` Closes #25490 from HyukjinKwon/SPARK-28756. Authored-by: HyukjinKwon Signed-off-by: HyukjinKwon --- R/pkg/DESCRIPTION | 2 +- R/pkg/R/client.R | 13 - 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION index 6a83e00..f478086 100644 --- a/R/pkg/DESCRIPTION +++ b/R/pkg/DESCRIPTION @@ -13,7 +13,7 @@ Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"), License: Apache License (== 2.0) URL: https://www.apache.org/ https://spark.apache.org/ BugReports: https://spark.apache.org/contributing.html -SystemRequirements: Java (>= 8) +SystemRequirements: Java (>= 8, < 12) Depends: R (>= 3.1), methods diff --git a/R/pkg/R/client.R b/R/pkg/R/client.R index 3299346..2ff68ab 100644 --- a/R/pkg/R/client.R +++ b/R/pkg/R/client.R @@ -64,7 +64,9 @@ checkJavaVersion <- function() { javaBin <- "java" javaHome <- Sys.getenv("JAVA_HOME") javaReqs <- utils::packageDescription(utils::packageName(), fields = c("SystemRequirements")) - sparkJavaVersion <- as.numeric(tail(strsplit(javaReqs, "[(=)]")[[1]], n = 1L)) + sparkJavaVersions <- strsplit(javaReqs, "[(,)]")[[1]] + minJavaVersion <- as.numeric(strsplit(sparkJavaVersions[[2]], ">= ")[[1]][[2]]) + maxJavaVersion <- as.numeric(strsplit(sparkJavaVersions[[3]], "< ")[[1]][[2]]) if (javaHome != "") { javaBin <- file.path(javaHome, "bin", javaBin) } @@ -99,10 +101,11 @@ checkJavaVersion <- function() { } else { javaVersionNum <- as.integer(versions[1]) } - if (javaVersionNum < sparkJavaVersion) { -stop(paste("Java version", sparkJavaVersion, - ", or greater, is required for this package; found version:", - javaVersionStr)) + if (javaVersionNum < minJavaVersion || javaVersionNum >= maxJavaVersion) { +stop(paste0("Java version, greater than or equal to ", minJavaVersion, +" and less than ", maxJavaVersion, +", is required for this package; found version: ", +javaVersionStr)) } return(javaVersionNum) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (97dc4c0 -> ec14b6e)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 97dc4c0 [SPARK-28744][SQL][TEST] rename SharedSQLContext to SharedSparkSession add ec14b6e [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base No new revisions were added by this update. Summary of changes: .../{pgSQL/join.sql => udf/pgSQL/udf-join.sql} | 450 .../join.sql.out => udf/pgSQL/udf-join.sql.out}| 578 ++--- 2 files changed, 515 insertions(+), 513 deletions(-) copy sql/core/src/test/resources/sql-tests/inputs/{pgSQL/join.sql => udf/pgSQL/udf-join.sql} (80%) copy sql/core/src/test/resources/sql-tests/results/{pgSQL/join.sql.out => udf/pgSQL/udf-join.sql.out} (71%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7701d29 -> ab1819d)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7701d29 [SPARK-28877][PYSPARK][test-hadoop3.2][test-java11] Make jaxb-runtime compile-time dependency add ab1819d [SPARK-28527][SQL][TEST][FOLLOW-UP] Ignores Thrift server ThriftServerQueryTestSuite No new revisions were added by this update. Summary of changes: .../apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala | 2 ++ 1 file changed, 2 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (02a0cde -> 4b16cf1)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 02a0cde [SPARK-28723][SQL] Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile add 4b16cf1 [SPARK-27988][SQL][TEST] Port AGGREGATES.sql [Part 3] No new revisions were added by this update. Summary of changes: .../sql-tests/inputs/pgSQL/aggregates_part3.sql| 270 + .../results/pgSQL/aggregates_part3.sql.out | 22 ++ 2 files changed, 292 insertions(+) create mode 100644 sql/core/src/test/resources/sql-tests/inputs/pgSQL/aggregates_part3.sql create mode 100644 sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (00cb2f9 -> c02c86e)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 00cb2f9 [SPARK-28881][PYTHON][TESTS] Add a test to make sure toPandas with Arrow optimization throws an exception per maxResultSize add c02c86e [SPARK-28691][EXAMPLES] Add Java/Scala DirectKerberizedKafkaWordCount examples No new revisions were added by this update. Summary of changes: ...ava => JavaDirectKerberizedKafkaWordCount.java} | 66 ++ scala => DirectKerberizedKafkaWordCount.scala} | 54 +++--- 2 files changed, 101 insertions(+), 19 deletions(-) copy examples/src/main/java/org/apache/spark/examples/streaming/{JavaDirectKafkaWordCount.java => JavaDirectKerberizedKafkaWordCount.java} (54%) copy examples/src/main/scala/org/apache/spark/examples/streaming/{DirectKafkaWordCount.scala => DirectKerberizedKafkaWordCount.scala} (54%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (90b10b4 -> 8848af2)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 90b10b4 [HOT-FIX] fix compilation add 8848af2 [SPARK-28881][PYTHON][TESTS][FOLLOW-UP] Use SparkSession(SparkContext(...)) to prevent for Spark conf to affect other tests No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/test_arrow.py | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r35482 - /dev/spark/v2.4.4-rc3-bin/ /release/spark/spark-2.4.4/
Author: gurwls223 Date: Sat Aug 31 05:15:37 2019 New Revision: 35482 Log: (empty) Added: release/spark/spark-2.4.4/ - copied from r35481, dev/spark/v2.4.4-rc3-bin/ Removed: dev/spark/v2.4.4-rc3-bin/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d5688dc -> 5cf2602)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d5688dc [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) to DataSource inserting for partitioned table add 5cf2602 [SPARK-28946][R][DOCS] Add some more information about building SparkR on Windows No new revisions were added by this update. Summary of changes: R/WINDOWS.md | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28963][BUILD] Fall back to archive.apache.org in build/mvn for older releases
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new df39855 [SPARK-28963][BUILD] Fall back to archive.apache.org in build/mvn for older releases df39855 is described below commit df39855db826fd4bead85a2ca01eda15c101bbbe Author: Sean Owen AuthorDate: Wed Sep 4 13:11:09 2019 +0900 [SPARK-28963][BUILD] Fall back to archive.apache.org in build/mvn for older releases ### What changes were proposed in this pull request? Fall back to archive.apache.org in `build/mvn` to download Maven, in case the ASF mirrors no longer have an older release. ### Why are the changes needed? If an older release's specified Maven doesn't exist in the mirrors, {{build/mvn}} will fail. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested different paths and failures by commenting in/out parts of the script and modifying it directly. Closes #25667 from srowen/SPARK-28963. Authored-by: Sean Owen Signed-off-by: HyukjinKwon --- build/mvn | 9 + 1 file changed, 9 insertions(+) diff --git a/build/mvn b/build/mvn index 75feb2f..f68377b 100755 --- a/build/mvn +++ b/build/mvn @@ -80,6 +80,15 @@ install_mvn() { fi if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then local APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download='} + +if [ $(command -v curl) ]; then + local TEST_MIRROR_URL="${APACHE_MIRROR}/maven/maven-3/${MVN_VERSION}/binaries/apache-maven-${MVN_VERSION}-bin.tar.gz" + if ! curl -L --output /dev/null --silent --head --fail "$TEST_MIRROR_URL" ; then +# Fall back to archive.apache.org for older Maven +echo "Falling back to archive.apache.org to download Maven" +APACHE_MIRROR="https://archive.apache.org/dist; + fi +fi install_app \ "${APACHE_MIRROR}/maven/maven-3/${MVN_VERSION}/binaries" \ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (4c8f114 -> a838699)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4c8f114 [SPARK-27489][WEBUI] UI updates to show executor resource information add a838699 [SPARK-28694][EXAMPLES] Add Java/Scala StructuredKerberizedKafkaWordCount examples No new revisions were added by this update. Summary of changes: ...=> JavaStructuredKerberizedKafkaWordCount.java} | 57 ++ ...la => StructuredKerberizedKafkaWordCount.scala} | 56 + 2 files changed, 94 insertions(+), 19 deletions(-) copy examples/src/main/java/org/apache/spark/examples/sql/streaming/{JavaStructuredKafkaWordCount.java => JavaStructuredKerberizedKafkaWordCount.java} (53%) copy examples/src/main/scala/org/apache/spark/examples/sql/streaming/{StructuredKafkaWordCount.scala => StructuredKerberizedKafkaWordCount.scala} (56%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (f8f7c52 -> 31b59bd)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from f8f7c52 [SPARK-28899][SQL][TEST] merge the testing in-memory v2 catalogs from catalyst and core add 31b59bd [SPARK-28843][PYTHON] Set OMP_NUM_THREADS to executor cores for python if not set No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/api/python/PythonRunner.scala| 7 +++ core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala | 9 + 2 files changed, 16 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (98e1a4c -> 1fd7f29)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 98e1a4c [SPARK-28319][SQL] Implement SHOW TABLES for Data Source V2 Tables add 1fd7f29 [SPARK-28857][INFRA] Clean up the comments of PR template during merging No new revisions were added by this update. Summary of changes: dev/merge_spark_pr.py | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (9f8c7a2 -> 13fd32c)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 9f8c7a2 [SPARK-28709][DSTREAMS] Fix StreamingContext leak through Streaming add 13fd32c [SPARK-28701][TEST-HADOOP3.2][TEST-JAVA11][K8S] adding java11 support for pull request builds No new revisions were added by this update. Summary of changes: dev/run-tests-jenkins.py | 2 +- dev/run-tests.py | 7 +++ .../kubernetes/integration-tests/dev/dev-run-integration-tests.sh | 6 ++ 3 files changed, 14 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c61270f -> 6e12b58)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c61270f [SPARK-27395][SQL] Improve EXPLAIN command add 6e12b58 [SPARK-28527][SQL][TEST] Re-run all the tests in SQLQueryTestSuite via Thrift Server No new revisions were added by this update. Summary of changes: project/SparkBuild.scala | 3 +- .../org/apache/spark/sql/SQLQueryTestSuite.scala | 86 ++--- sql/hive-thriftserver/pom.xml | 7 + .../sql/hive/thriftserver/HiveThriftServer2.scala | 3 +- .../thriftserver/ThriftServerQueryTestSuite.scala | 358 + 5 files changed, 416 insertions(+), 41 deletions(-) create mode 100644 sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (19f882c -> bd3915e)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 19f882c [SPARK-28933][ML] Reduce unnecessary shuffle in ALS when initializing factors add bd3915e Revert "[SPARK-28612][SQL] Add DataFrameWriterV2 API" No new revisions were added by this update. Summary of changes: .../catalyst/expressions/PartitionTransforms.scala | 77 .../spark/sql/catalyst/analysis/Analyzer.scala | 6 +- .../plans/logical/basicLogicalOperators.scala | 47 +- .../datasources/v2/DataSourceV2Implicits.scala | 9 - .../apache/spark/sql/connector/InMemoryTable.scala | 5 +- .../org/apache/spark/sql/DataFrameWriter.scala | 11 +- .../org/apache/spark/sql/DataFrameWriterV2.scala | 365 --- .../main/scala/org/apache/spark/sql/Dataset.scala | 28 -- .../datasources/v2/DataSourceV2Strategy.scala | 20 +- .../datasources/v2/V2WriteSupportCheck.scala | 6 +- .../scala/org/apache/spark/sql/functions.scala | 64 --- .../sql/sources/v2/DataFrameWriterV2Suite.scala| 508 - 12 files changed, 36 insertions(+), 1110 deletions(-) delete mode 100644 sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/PartitionTransforms.scala delete mode 100644 sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala delete mode 100644 sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataFrameWriterV2Suite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (bd3915e -> e1946a5)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from bd3915e Revert "[SPARK-28612][SQL] Add DataFrameWriterV2 API" add e1946a5 [SPARK-28705][SQL][TEST] Drop tables after being used in AnalysisExternalCatalogSuite No new revisions were added by this update. Summary of changes: .../analysis/AnalysisExternalCatalogSuite.scala| 48 -- 1 file changed, 27 insertions(+), 21 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28881][PYTHON][TESTS] Add a test to make sure toPandas with Arrow optimization throws an exception per maxResultSize
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 00cb2f9 [SPARK-28881][PYTHON][TESTS] Add a test to make sure toPandas with Arrow optimization throws an exception per maxResultSize 00cb2f9 is described below commit 00cb2f99ccbd7c0fdba19ba63c4ec73ca97dea66 Author: HyukjinKwon AuthorDate: Tue Aug 27 17:30:06 2019 +0900 [SPARK-28881][PYTHON][TESTS] Add a test to make sure toPandas with Arrow optimization throws an exception per maxResultSize ### What changes were proposed in this pull request? This PR proposes to add a test case for: ```bash ./bin/pyspark --conf spark.driver.maxResultSize=1m spark.conf.set("spark.sql.execution.arrow.enabled",True) ``` ```python spark.range(1000).toPandas() ``` ``` Empty DataFrame Columns: [id] Index: [] ``` which can result in partial results (see https://github.com/apache/spark/pull/25593#issuecomment-525153808). This regression was found between Spark 2.3 and Spark 2.4, and accidentally fixed. ### Why are the changes needed? To prevent the same regression in the future. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Test was added. Closes #25594 from HyukjinKwon/SPARK-28881. Authored-by: HyukjinKwon Signed-off-by: HyukjinKwon --- python/pyspark/sql/tests/test_arrow.py | 31 ++- 1 file changed, 30 insertions(+), 1 deletion(-) diff --git a/python/pyspark/sql/tests/test_arrow.py b/python/pyspark/sql/tests/test_arrow.py index f533083..50c82b0 100644 --- a/python/pyspark/sql/tests/test_arrow.py +++ b/python/pyspark/sql/tests/test_arrow.py @@ -22,7 +22,7 @@ import time import unittest import warnings -from pyspark.sql import Row +from pyspark.sql import Row, SparkSession from pyspark.sql.functions import udf from pyspark.sql.types import * from pyspark.testing.sqlutils import ReusedSQLTestCase, have_pandas, have_pyarrow, \ @@ -421,6 +421,35 @@ class ArrowTests(ReusedSQLTestCase): run_test(*case) +@unittest.skipIf( +not have_pandas or not have_pyarrow, +pandas_requirement_message or pyarrow_requirement_message) +class MaxResultArrowTests(unittest.TestCase): +# These tests are separate as 'spark.driver.maxResultSize' configuration +# is a static configuration to Spark context. + +@classmethod +def setUpClass(cls): +cls.spark = SparkSession.builder \ +.master("local[4]") \ +.appName(cls.__name__) \ +.config("spark.driver.maxResultSize", "10k") \ +.getOrCreate() + +# Explicitly enable Arrow and disable fallback. +cls.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true") + cls.spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "false") + +@classmethod +def tearDownClass(cls): +if hasattr(cls, "spark"): +cls.spark.stop() + +def test_exception_by_max_results(self): +with self.assertRaisesRegexp(Exception, "is bigger than"): +self.spark.range(0, 1, 1, 100).toPandas() + + class EncryptionArrowTests(ArrowTests): @classmethod - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7452786 -> 137b20b)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7452786 [SPARK-28789][DOCS][SQL] Document ALTER DATABASE command add 137b20b [SPARK-28818][SQL] Respect source column nullability in the arrays created by `freqItems()` No new revisions were added by this update. Summary of changes: .../spark/sql/execution/stat/FrequentItems.scala | 19 .../org/apache/spark/sql/DataFrameStatSuite.scala | 26 +- 2 files changed, 35 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c18f849 -> 7ce0f2b)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c18f849 [SPARK-24663][STREAMING][TESTS] StreamingContextSuite: Wait until slow receiver has been initialized, but with hard timeout add 7ce0f2b [SPARK-29041][PYTHON] Allows createDataFrame to accept bytes as binary type No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/test_serde.py | 4 python/pyspark/sql/types.py| 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (8f632d7 -> 850833f)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8f632d7 [MINOR][DOCS] Fix few typos in the java docs add 850833f [SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/sql/internal/SQLConf.scala | 3 ++- .../org/apache/spark/sql/internal/SQLConfSuite.scala | 19 +++ 2 files changed, 21 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR][DOCS] Fix few typos in the java docs
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8f632d7 [MINOR][DOCS] Fix few typos in the java docs 8f632d7 is described below commit 8f632d70455156010f0e87288541304ad2164a52 Author: dengziming AuthorDate: Thu Sep 12 09:30:03 2019 +0900 [MINOR][DOCS] Fix few typos in the java docs JIRA :https://issues.apache.org/jira/browse/SPARK-29050 'a hdfs' change into 'an hdfs' 'an unique' change into 'a unique' 'an url' change into 'a url' 'a error' change into 'an error' Closes #25756 from dengziming/feature_fix_typos. Authored-by: dengziming Signed-off-by: HyukjinKwon --- R/pkg/R/context.R | 4 ++-- core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala | 2 +- core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala | 2 +- core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala | 2 +- core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala| 2 +- docs/spark-standalone.md | 2 +- .../org/apache/spark/streaming/kinesis/KinesisCheckpointer.scala | 2 +- python/pyspark/context.py | 2 +- .../sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala | 4 ++-- .../scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala | 4 ++-- sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala | 2 +- .../src/test/resources/ql/src/test/queries/clientpositive/load_fs2.q | 2 +- .../main/scala/org/apache/spark/streaming/dstream/InputDStream.scala | 4 ++-- 13 files changed, 17 insertions(+), 17 deletions(-) diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R index 51ae2d2..93ba130 100644 --- a/R/pkg/R/context.R +++ b/R/pkg/R/context.R @@ -301,7 +301,7 @@ broadcastRDD <- function(sc, object) { #' Set the checkpoint directory #' #' Set the directory under which RDDs are going to be checkpointed. The -#' directory must be a HDFS path if running on a cluster. +#' directory must be an HDFS path if running on a cluster. #' #' @param sc Spark Context to use #' @param dirName Directory path @@ -446,7 +446,7 @@ setLogLevel <- function(level) { #' Set checkpoint directory #' #' Set the directory under which SparkDataFrame are going to be checkpointed. The directory must be -#' a HDFS path if running on a cluster. +#' an HDFS path if running on a cluster. #' #' @rdname setCheckpointDir #' @param directory Directory path to checkpoint to diff --git a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala index 330c2f6..3485128 100644 --- a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala +++ b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala @@ -609,7 +609,7 @@ class JavaSparkContext(val sc: SparkContext) extends Closeable { /** * Set the directory under which RDDs are going to be checkpointed. The directory must - * be a HDFS path if running on a cluster. + * be an HDFS path if running on a cluster. */ def setCheckpointDir(dir: String) { sc.setCheckpointDir(dir) diff --git a/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala b/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala index c96640a..b552444 100644 --- a/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala +++ b/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala @@ -124,7 +124,7 @@ private[spark] class MetricsSystem private ( * If either ID is not available, this defaults to just using . * * @param source Metric source to be named by this method. - * @return An unique metric name for each combination of + * @return A unique metric name for each combination of * application, executor/driver and metric source. */ private[spark] def buildRegistryName(source: Source): String = { diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala index d188bdd..49e32d0 100644 --- a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala +++ b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala @@ -27,7 +27,7 @@ import org.apache.spark.util.Utils /** * :: DeveloperApi :: - * This class represent an unique identifier for a BlockManager. + * This class represent a unique identifier for a BlockManager. * * The first 2 constructors of this class are made private to ensure that BlockManagerId objects * can be created only using the apply method in the companion object. This allows de-duplication diff --git a/core/src/test/sca
[spark] branch branch-2.4 updated: [MINOR][DOCS] Fix few typos in the java docs
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new ecb2052 [MINOR][DOCS] Fix few typos in the java docs ecb2052 is described below commit ecb2052bf0cf7dea749cca10d864f7383eeb1224 Author: dengziming AuthorDate: Thu Sep 12 09:30:03 2019 +0900 [MINOR][DOCS] Fix few typos in the java docs JIRA :https://issues.apache.org/jira/browse/SPARK-29050 'a hdfs' change into 'an hdfs' 'an unique' change into 'a unique' 'an url' change into 'a url' 'a error' change into 'an error' Closes #25756 from dengziming/feature_fix_typos. Authored-by: dengziming Signed-off-by: HyukjinKwon (cherry picked from commit 8f632d70455156010f0e87288541304ad2164a52) Signed-off-by: HyukjinKwon --- R/pkg/R/context.R | 4 ++-- core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala | 2 +- core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala | 2 +- core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala | 2 +- core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala| 2 +- docs/spark-standalone.md | 2 +- .../org/apache/spark/streaming/kinesis/KinesisCheckpointer.scala | 2 +- python/pyspark/context.py | 2 +- .../sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala | 4 ++-- .../scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala | 4 ++-- sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala | 2 +- .../src/test/resources/ql/src/test/queries/clientpositive/load_fs2.q | 2 +- .../main/scala/org/apache/spark/streaming/dstream/InputDStream.scala | 4 ++-- 13 files changed, 17 insertions(+), 17 deletions(-) diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R index b49f7c3..f1a6b84 100644 --- a/R/pkg/R/context.R +++ b/R/pkg/R/context.R @@ -297,7 +297,7 @@ broadcastRDD <- function(sc, object) { #' Set the checkpoint directory #' #' Set the directory under which RDDs are going to be checkpointed. The -#' directory must be a HDFS path if running on a cluster. +#' directory must be an HDFS path if running on a cluster. #' #' @param sc Spark Context to use #' @param dirName Directory path @@ -442,7 +442,7 @@ setLogLevel <- function(level) { #' Set checkpoint directory #' #' Set the directory under which SparkDataFrame are going to be checkpointed. The directory must be -#' a HDFS path if running on a cluster. +#' an HDFS path if running on a cluster. #' #' @rdname setCheckpointDir #' @param directory Directory path to checkpoint to diff --git a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala index 09c8384..09e9910 100644 --- a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala +++ b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala @@ -713,7 +713,7 @@ class JavaSparkContext(val sc: SparkContext) /** * Set the directory under which RDDs are going to be checkpointed. The directory must - * be a HDFS path if running on a cluster. + * be an HDFS path if running on a cluster. */ def setCheckpointDir(dir: String) { sc.setCheckpointDir(dir) diff --git a/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala b/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala index 3457a26..657d75c 100644 --- a/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala +++ b/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala @@ -122,7 +122,7 @@ private[spark] class MetricsSystem private ( * If either ID is not available, this defaults to just using . * * @param source Metric source to be named by this method. - * @return An unique metric name for each combination of + * @return A unique metric name for each combination of * application, executor/driver and metric source. */ private[spark] def buildRegistryName(source: Source): String = { diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala index d4a59c3..83cd4f0 100644 --- a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala +++ b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala @@ -27,7 +27,7 @@ import org.apache.spark.util.Utils /** * :: DeveloperApi :: - * This class represent an unique identifier for a BlockManager. + * This class represent a unique identifier for a BlockManager. * * The first 2 constructors of this class are made private to ensure that BlockManagerId objects * can be created only using the apply
[spark] branch master updated (7ce0f2b -> eec728a)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7ce0f2b [SPARK-29041][PYTHON] Allows createDataFrame to accept bytes as binary type add eec728a [SPARK-29057][SQL] remove InsertIntoTable No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 4 +-- .../sql/catalyst/analysis/CheckAnalysis.scala | 9 +++--- .../plans/logical/basicLogicalOperators.scala | 36 -- .../org/apache/spark/sql/DataFrameWriter.scala | 9 +++--- .../execution/datasources/DataSourceStrategy.scala | 11 --- .../datasources/FallBackFileSourceV2.scala | 7 +++-- .../spark/sql/execution/datasources/rules.scala| 25 +++ .../spark/sql/util/DataFrameCallbackSuite.scala| 7 +++-- .../org/apache/spark/sql/hive/HiveStrategies.scala | 15 - .../org/apache/spark/sql/hive/InsertSuite.scala| 1 - 10 files changed, 46 insertions(+), 78 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated (0a4b356 -> 483dcf5)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git. from 0a4b356 Revert "[SPARK-28912][STREAMING] Fixed MatchError in getCheckpointFiles()" add 483dcf5 [SPARK-28912][BRANCH-2.4] Fixed MatchError in getCheckpointFiles() No new revisions were added by this update. Summary of changes: .../org/apache/spark/streaming/Checkpoint.scala | 4 ++-- .../org/apache/spark/streaming/CheckpointSuite.scala | 20 2 files changed, 22 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (54d3f6e -> fa75db2)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 54d3f6e [SPARK-28982][SQL] Implementation Spark's own GetTypeInfoOperation add fa75db2 [SPARK-29026][SQL] Improve error message in `schemaFor` in trait without companion object constructor No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/ScalaReflection.scala | 13 - .../spark/sql/catalyst/ScalaReflectionSuite.scala | 22 ++ 2 files changed, 34 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c862835 -> db996cc)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c862835 [SPARK-28996][SQL][TESTS] Add tests regarding username of HiveClient add db996cc [SPARK-29074][SQL] Optimize `date_format` for foldable `fmt` No new revisions were added by this update. Summary of changes: .../catalyst/expressions/datetimeExpressions.scala | 32 -- sql/core/benchmarks/DateTimeBenchmark-results.txt | 4 +-- 2 files changed, 26 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (104b9b6 -> 34915b2)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 104b9b6 [SPARK-28483][FOLLOW-UP] Fix flaky test in BarrierTaskContextSuite add 34915b2 [SPARK-29104][CORE][TESTS] Fix PipedRDDSuite to use `eventually` to check thread termination No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/rdd/PipedRDDSuite.scala| 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (723faad -> a754674)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 723faad [SPARK-28912][STREAMING] Fixed MatchError in getCheckpointFiles() add a754674 [SPARK-28000][SQL][TEST] Port comments.sql No new revisions were added by this update. Summary of changes: .../resources/sql-tests/inputs/pgSQL/comments.sql | 48 + .../sql-tests/results/pgSQL/comments.sql.out | 196 + 2 files changed, 244 insertions(+) create mode 100644 sql/core/src/test/resources/sql-tests/inputs/pgSQL/comments.sql create mode 100644 sql/core/src/test/resources/sql-tests/results/pgSQL/comments.sql.out - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (aa805ec -> 86fc890)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from aa805ec [SPARK-23265][ML] Update multi-column error handling logic in QuantileDiscretizer add 86fc890 [SPARK-28988][SQL][TESTS] Fix invalid tests in CliSuite No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/hive/thriftserver/CliSuite.scala | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (4559a82 -> eef5e6d)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 4559a82 [SPARK-28930][SQL] Last Access Time value shall display 'UNKNOWN' in all clients add eef5e6d [SPARK-29113][DOC] Fix some annotation errors and remove meaningless annotations in project No new revisions were added by this update. Summary of changes: .../main/java/org/apache/spark/io/NioBufferedFileInputStream.java | 1 - core/src/main/java/org/apache/spark/memory/MemoryConsumer.java | 1 - .../java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java | 1 - .../spark/util/collection/unsafe/sort/UnsafeExternalSorter.java| 2 -- .../scala/org/apache/spark/deploy/history/ApplicationCache.scala | 7 +++ .../main/scala/org/apache/spark/scheduler/TaskSetBlacklist.scala | 1 - core/src/main/scala/org/apache/spark/storage/BlockManager.scala| 1 - .../apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala | 1 - .../main/scala/org/apache/spark/sql/execution/ExplainUtils.scala | 4 ++-- 9 files changed, 5 insertions(+), 14 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (05988b2 -> 3ece8ee)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 05988b2 [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs add 3ece8ee [SPARK-29124][CORE] Use MurmurHash3 `bytesHash(data, seed)` instead of `bytesHash(data)` No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/util/random/XORShiftRandom.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3ece8ee -> 4559a82)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3ece8ee [SPARK-29124][CORE] Use MurmurHash3 `bytesHash(data, seed)` instead of `bytesHash(data)` add 4559a82 [SPARK-28930][SQL] Last Access Time value shall display 'UNKNOWN' in all clients No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/sql/catalyst/catalog/interface.scala| 7 +-- .../scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala | 2 +- 2 files changed, 6 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (12e1583 -> c2734ab)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 12e1583 [SPARK-28927][ML] Rethrow block mismatch exception in ALS when input data is nondeterministic add c2734ab [SPARK-29012][SQL] Support special timestamp values No new revisions were added by this update. Summary of changes: .../catalyst/expressions/datetimeExpressions.scala | 4 +- .../spark/sql/catalyst/util/DateTimeUtils.scala| 67 .../sql/catalyst/util/TimestampFormatter.scala | 21 ++- .../sql/catalyst/util/DateTimeUtilsSuite.scala | 36 +++- .../spark/sql/util/TimestampFormatterSuite.scala | 28 ++- .../resources/sql-tests/inputs/pgSQL/timestamp.sql | 29 ++-- .../sql-tests/results/pgSQL/timestamp.sql.out | 190 - .../org/apache/spark/sql/CsvFunctionsSuite.scala | 10 ++ .../org/apache/spark/sql/JsonFunctionsSuite.scala | 10 ++ 9 files changed, 317 insertions(+), 78 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c2734ab -> 203bf9e)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c2734ab [SPARK-29012][SQL] Support special timestamp values add 203bf9e [SPARK-19926][PYSPARK] make captured exception from JVM side user friendly No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/test_utils.py | 7 +++ python/pyspark/sql/utils.py| 8 +++- 2 files changed, 14 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7d4eb38 -> 1b99d0c)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7d4eb38 [SPARK-29052][DOCS][ML][PYTHON][CORE][R][SQL][SS] Create a Migration Guide tap in Spark documentation add 1b99d0c [SPARK-29069][SQL] ResolveInsertInto should not do table lookup No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 58 ++ 1 file changed, 27 insertions(+), 31 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28971][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_part4.sql' into UDF test base
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new be04c97 [SPARK-28971][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_part4.sql' into UDF test base be04c97 is described below commit be04c972623df0c44d92ba55a9efadf59a27089e Author: HyukjinKwon AuthorDate: Thu Sep 5 18:34:44 2019 +0900 [SPARK-28971][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_part4.sql' into UDF test base ### What changes were proposed in this pull request? This PR proposes to port `pgSQL/aggregates_part4.sql` into UDF test base. Diff comparing to 'pgSQL/aggregates_part3.sql' ```diff ``` ### Why are the changes needed? To improve test coverage in UDFs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested via: ```bash build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part4.sql" ``` as guided in https://issues.apache.org/jira/browse/SPARK-27921 Closes #25677 from HyukjinKwon/SPARK-28971. Authored-by: HyukjinKwon Signed-off-by: HyukjinKwon --- .../inputs/udf/pgSQL/udf-aggregates_part4.sql | 421 + .../results/udf/pgSQL/udf-aggregates_part4.sql.out | 5 + 2 files changed, 426 insertions(+) diff --git a/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part4.sql b/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part4.sql new file mode 100644 index 000..7c3 --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part4.sql @@ -0,0 +1,421 @@ +-- +-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group +-- +-- +-- AGGREGATES [Part 4] +-- https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/aggregates.sql#L607-L997 + +-- This test file was converted from pgSQL/aggregates_part4.sql. + +-- [SPARK-27980] Ordered-Set Aggregate Functions +-- ordered-set aggregates + +-- select p, percentile_cont(p) within group (order by x::float8) +-- from generate_series(1,5) x, +-- (values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) v(p) +-- group by p order by p; + +-- select p, percentile_cont(p order by p) within group (order by x) -- error +-- from generate_series(1,5) x, +-- (values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) v(p) +-- group by p order by p; + +-- select p, sum() within group (order by x::float8) -- error +-- from generate_series(1,5) x, +-- (values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) v(p) +-- group by p order by p; + +-- select p, percentile_cont(p,p) -- error +-- from generate_series(1,5) x, +-- (values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) v(p) +-- group by p order by p; + +-- select percentile_cont(0.5) within group (order by b) from aggtest; +-- select percentile_cont(0.5) within group (order by b), sum(b) from aggtest; +-- select percentile_cont(0.5) within group (order by thousand) from tenk1; +-- select percentile_disc(0.5) within group (order by thousand) from tenk1; +-- [SPARK-28661] Hypothetical-Set Aggregate Functions +-- select rank(3) within group (order by x) +-- from (values (1),(1),(2),(2),(3),(3),(4)) v(x); +-- select cume_dist(3) within group (order by x) +-- from (values (1),(1),(2),(2),(3),(3),(4)) v(x); +-- select percent_rank(3) within group (order by x) +-- from (values (1),(1),(2),(2),(3),(3),(4),(5)) v(x); +-- select dense_rank(3) within group (order by x) +-- from (values (1),(1),(2),(2),(3),(3),(4)) v(x); + +-- [SPARK-27980] Ordered-Set Aggregate Functions +-- select percentile_disc(array[0,0.1,0.25,0.5,0.75,0.9,1]) within group (order by thousand) +-- from tenk1; +-- select percentile_cont(array[0,0.25,0.5,0.75,1]) within group (order by thousand) +-- from tenk1; +-- select percentile_disc(array[[null,1,0.5],[0.75,0.25,null]]) within group (order by thousand) +-- from tenk1; +-- select percentile_cont(array[0,1,0.25,0.75,0.5,1,0.3,0.32,0.35,0.38,0.4]) within group (order by x) +-- from generate_series(1,6) x; + +-- [SPARK-27980] Ordered-Set Aggregate Functions +-- [SPARK-28382] Array Functions: unnest +-- select ten, mode() within group (order by string4) from tenk1 group by ten; + +-- select percentile_disc(array[0.25,0.5,0.75]) within group (order by x) +-- from unnest('{fred,jim,fred,jack,jill,fred,jill,jim,jim,sheila,jim,sheila}'::text[]) u(x); + +-- [SPARK-28669] System Information Functions +-- check collation propagates up in suitable cases: +-- select pg_collation_for(percentile_disc(1) within group (order by x collate "POSIX")) +-- from (values (
[spark] branch master updated (be04c97 -> 103d50b)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from be04c97 [SPARK-28971][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_part4.sql' into UDF test base add 103d50b [SPARK-28272][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_part3.sql' into UDF test base No new revisions were added by this update. Summary of changes: .../aggregates_part3.sql => udf/pgSQL/udf-aggregates_part3.sql} | 8 +--- .../pgSQL/udf-aggregates_part3.sql.out} | 8 2 files changed, 9 insertions(+), 7 deletions(-) copy sql/core/src/test/resources/sql-tests/inputs/{pgSQL/aggregates_part3.sql => udf/pgSQL/udf-aggregates_part3.sql} (98%) copy sql/core/src/test/resources/sql-tests/results/{pgSQL/aggregates_part3.sql.out => udf/pgSQL/udf-aggregates_part3.sql.out} (74%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (f8bc91f -> 36559b6)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from f8bc91f [SPARK-28782][SQL] Generator support in aggregate expressions add 36559b6 [SPARK-28977][DOCS][SQL] Fix DataFrameReader.json docs to doc that partition column can be numeric, date or timestamp type No new revisions were added by this update. Summary of changes: R/pkg/R/SQLContext.R | 3 ++- python/pyspark/sql/readwriter.py | 3 ++- sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 3 ++- 3 files changed, 6 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 992b1bb [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress 992b1bb is described below commit 992b1bb3697f6ec9fb5fc2853c33b591715dfd76 Author: gengjiaan AuthorDate: Wed Jul 31 12:17:44 2019 +0900 [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress ## What changes were proposed in this pull request? The latest docs http://spark.apache.org/docs/latest/configuration.html contains some description as below: spark.ui.showConsoleProgress | true | Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. -- | -- | -- But the class `org.apache.spark.internal.config.UI` define the config `spark.ui.showConsoleProgress` as below: ``` val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress") .doc("When true, show the progress bar in the console.") .booleanConf .createWithDefault(false) ``` So I think there are exists some little mistake and lead to confuse reader. ## How was this patch tested? No need UT. Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress. Authored-by: gengjiaan Signed-off-by: HyukjinKwon (cherry picked from commit dba4375359a2dfed1f009edc3b1bcf6b3253fe02) Signed-off-by: HyukjinKwon --- docs/configuration.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index 73bae77..d481986 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -891,11 +891,13 @@ Apart from these, the following properties are also available, and may be useful spark.ui.showConsoleProgress - true + false Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. + +Note: In shell environment, the default value of spark.ui.showConsoleProgress is true. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new dba4375 [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress dba4375 is described below commit dba4375359a2dfed1f009edc3b1bcf6b3253fe02 Author: gengjiaan AuthorDate: Wed Jul 31 12:17:44 2019 +0900 [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress ## What changes were proposed in this pull request? The latest docs http://spark.apache.org/docs/latest/configuration.html contains some description as below: spark.ui.showConsoleProgress | true | Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. -- | -- | -- But the class `org.apache.spark.internal.config.UI` define the config `spark.ui.showConsoleProgress` as below: ``` val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress") .doc("When true, show the progress bar in the console.") .booleanConf .createWithDefault(false) ``` So I think there are exists some little mistake and lead to confuse reader. ## How was this patch tested? No need UT. Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress. Authored-by: gengjiaan Signed-off-by: HyukjinKwon --- docs/configuration.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index 10886241..93a4fcc 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1062,11 +1062,13 @@ Apart from these, the following properties are also available, and may be useful spark.ui.showConsoleProgress - true + false Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. + +Note: In shell environment, the default value of spark.ui.showConsoleProgress is true. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.3 updated: [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.3 by this push: new 78d1bb1 [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress 78d1bb1 is described below commit 78d1bb188efa55038f63ece70aaa0a5ebaa75f5f Author: gengjiaan AuthorDate: Wed Jul 31 12:17:44 2019 +0900 [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress ## What changes were proposed in this pull request? The latest docs http://spark.apache.org/docs/latest/configuration.html contains some description as below: spark.ui.showConsoleProgress | true | Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. -- | -- | -- But the class `org.apache.spark.internal.config.UI` define the config `spark.ui.showConsoleProgress` as below: ``` val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress") .doc("When true, show the progress bar in the console.") .booleanConf .createWithDefault(false) ``` So I think there are exists some little mistake and lead to confuse reader. ## How was this patch tested? No need UT. Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress. Authored-by: gengjiaan Signed-off-by: HyukjinKwon (cherry picked from commit dba4375359a2dfed1f009edc3b1bcf6b3253fe02) Signed-off-by: HyukjinKwon --- docs/configuration.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index 42e6d87..822f903 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -867,11 +867,13 @@ Apart from these, the following properties are also available, and may be useful spark.ui.showConsoleProgress - true + false Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. + +Note: In shell environment, the default value of spark.ui.showConsoleProgress is true. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28038][SQL][TEST] Port text.sql
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 261e113 [SPARK-28038][SQL][TEST] Port text.sql 261e113 is described below commit 261e113449cf1ae84feb073caff01cc27cb5d10f Author: Yuming Wang AuthorDate: Wed Jul 31 11:36:26 2019 +0900 [SPARK-28038][SQL][TEST] Port text.sql ## What changes were proposed in this pull request? This PR is to port text.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/text.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/text.out When porting the test cases, found a PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28037](https://issues.apache.org/jira/browse/SPARK-28037): Add built-in String Functions: quote_literal Also, found three inconsistent behavior: [SPARK-27930](https://issues.apache.org/jira/browse/SPARK-27930): Spark SQL's format_string can not fully support PostgreSQL's format [SPARK-28036](https://issues.apache.org/jira/browse/SPARK-28036): Built-in udf left/right has inconsistent behavior [SPARK-28033](https://issues.apache.org/jira/browse/SPARK-28033): String concatenation should low priority than other operators ## How was this patch tested? N/A Closes #24862 from wangyum/SPARK-28038. Authored-by: Yuming Wang Signed-off-by: HyukjinKwon --- .../test/resources/sql-tests/inputs/pgSQL/text.sql | 137 .../resources/sql-tests/results/pgSQL/text.sql.out | 375 + 2 files changed, 512 insertions(+) diff --git a/sql/core/src/test/resources/sql-tests/inputs/pgSQL/text.sql b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/text.sql new file mode 100644 index 000..04d3acc --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/text.sql @@ -0,0 +1,137 @@ +-- +-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group +-- +-- +-- TEXT +-- https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/text.sql + +SELECT string('this is a text string') = string('this is a text string') AS true; + +SELECT string('this is a text string') = string('this is a text strin') AS `false`; + +CREATE TABLE TEXT_TBL (f1 string) USING parquet; + +INSERT INTO TEXT_TBL VALUES ('doh!'); +INSERT INTO TEXT_TBL VALUES ('hi de ho neighbor'); + +SELECT '' AS two, * FROM TEXT_TBL; + +-- As of 8.3 we have removed most implicit casts to text, so that for example +-- this no longer works: +-- Spark SQL implicit cast integer to string +select length(42); + +-- But as a special exception for usability's sake, we still allow implicit +-- casting to text in concatenations, so long as the other input is text or +-- an unknown literal. So these work: +-- [SPARK-28033] String concatenation low priority than other arithmeticBinary +select string('four: ') || 2+2; +select 'four: ' || 2+2; + +-- but not this: +-- Spark SQL implicit cast both side to string +select 3 || 4.0; + +/* + * various string functions + */ +select concat('one'); +-- Spark SQL does not support MMDD, we replace it to MMdd +select concat(1,2,3,'hello',true, false, to_date('20100309','MMdd')); +select concat_ws('#','one'); +select concat_ws('#',1,2,3,'hello',true, false, to_date('20100309','MMdd')); +select concat_ws(',',10,20,null,30); +select concat_ws('',10,20,null,30); +select concat_ws(NULL,10,20,null,30) is null; +select reverse('abcde'); +-- [SPARK-28036] Built-in udf left/right has inconsistent behavior +-- [SPARK-28479] Parser error when enabling ANSI mode +set spark.sql.parser.ansi.enabled=false; +select i, left('ahoj', i), right('ahoj', i) from range(-5, 6) t(i) order by i; +set spark.sql.parser.ansi.enabled=true; +-- [SPARK-28037] Add built-in String Functions: quote_literal +-- select quote_literal(''); +-- select quote_literal('abc'''); +-- select quote_literal(e'\\'); + +-- Skip these tests because Spark does not support variadic labeled argument +-- check variadic labeled argument +-- select concat(variadic array[1,2,3]); +-- select concat_ws(',', variadic array[1,2,3]); +-- select concat_ws(',', variadic NULL::int[]); +-- select concat(variadic NULL::int[]) is NULL; +-- select concat(variadic '{}'::int[]) = ''; +--should fail +-- select concat_ws(',', variadic 10); + +-- [SPARK-27930] Replace format to format_string +/* + * format + */ +select format_string(NULL); +select format_string('Hello'); +select format_string('Hello %s', 'World'); +select format_string('Hello %%'); +select format_string('Hello '); +-- should fail +select format_string('Hello %s %s', 'World'); +select format_string('Hello %s'); +select format_string('Hello %x', 20
[spark] branch master updated: [SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a745381 [SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0 a745381 is described below commit a745381b9d3dd290057ef3089de7fdb9264f1f8b Author: WeichenXu AuthorDate: Wed Jul 31 14:26:18 2019 +0900 [SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0 ## What changes were proposed in this pull request? I remove the deprecate `ImageSchema.readImages`. Move some useful methods from class `ImageSchema` into class `ImageFileFormat`. In pyspark, I rename `ImageSchema` class to be `ImageUtils`, and keep some useful python methods in it. ## How was this patch tested? UT. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25245 from WeichenXu123/remove_image_schema. Authored-by: WeichenXu Signed-off-by: HyukjinKwon --- .../org/apache/spark/ml/image/ImageSchema.scala| 72 - .../apache/spark/ml/image/ImageSchemaSuite.scala | 171 - .../ml/source/image/ImageFileFormatSuite.scala | 18 +++ project/MimaExcludes.scala | 6 +- python/pyspark/ml/image.py | 38 - python/pyspark/ml/tests/test_image.py | 29 ++-- 6 files changed, 42 insertions(+), 292 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala index a7ddf2f..0313626 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala @@ -191,76 +191,4 @@ object ImageSchema { Some(Row(Row(origin, height, width, nChannels, mode, decoded))) } } - - /** - * Read the directory of images from the local or remote source - * - * @note If multiple jobs are run in parallel with different sampleRatio or recursive flag, - * there may be a race condition where one job overwrites the hadoop configs of another. - * @note If sample ratio is less than 1, sampling uses a PathFilter that is efficient but - * potentially non-deterministic. - * - * @param path Path to the image directory - * @return DataFrame with a single column "image" of images; - * see ImageSchema for the details - */ - @deprecated("use `spark.read.format(\"image\").load(path)` and this `readImages` will be " + -"removed in 3.0.0.", "2.4.0") - def readImages(path: String): DataFrame = readImages(path, null, false, -1, false, 1.0, 0) - - /** - * Read the directory of images from the local or remote source - * - * @note If multiple jobs are run in parallel with different sampleRatio or recursive flag, - * there may be a race condition where one job overwrites the hadoop configs of another. - * @note If sample ratio is less than 1, sampling uses a PathFilter that is efficient but - * potentially non-deterministic. - * - * @param path Path to the image directory - * @param sparkSession Spark Session, if omitted gets or creates the session - * @param recursive Recursive path search flag - * @param numPartitions Number of the DataFrame partitions, - * if omitted uses defaultParallelism instead - * @param dropImageFailures Drop the files that are not valid images from the result - * @param sampleRatio Fraction of the files loaded - * @return DataFrame with a single column "image" of images; - * see ImageSchema for the details - */ - @deprecated("use `spark.read.format(\"image\").load(path)` and this `readImages` will be " + -"removed in 3.0.0.", "2.4.0") - def readImages( - path: String, - sparkSession: SparkSession, - recursive: Boolean, - numPartitions: Int, - dropImageFailures: Boolean, - sampleRatio: Double, - seed: Long): DataFrame = { -require(sampleRatio <= 1.0 && sampleRatio >= 0, "sampleRatio should be between 0 and 1") - -val session = if (sparkSession != null) sparkSession else SparkSession.builder().getOrCreate -val partitions = - if (numPartitions > 0) { -numPartitions - } else { -session.sparkContext.defaultParallelism - } - -RecursiveFlag.withRecursiveFlag(recursive, session) { - SamplePathFilter.withPathFilter(sampleRatio, session, seed) { -val binResult = session.sparkContext.binaryFiles(path, partitions) -val streams = if (numPartitions == -1) binResult else binResult.repartition(partitions) -val convert = (origi
[spark] branch master updated: [SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3b14088 [SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon 3b14088 is described below commit 3b140885410362fced9d98fca61d6a357de604af Author: WeichenXu AuthorDate: Wed Jul 31 09:10:24 2019 +0900 [SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon ## What changes were proposed in this pull request? PySpark worker daemon reads from stdin the worker PIDs to kill. https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127 However, the worker process is a forked process from the worker daemon process and we didn't close stdin on the child after fork. This means the child and user program can read stdin as well, which blocks daemon from receiving the PID to kill. This can cause issues because the task reaper might detect the task was not terminated and eventually kill the JVM. This PR fix this by redirecting the standard input of the forked child to devnull. ## How was this patch tested? Manually test. In `pyspark`, run: ``` import subprocess def task(_): subprocess.check_output(["cat"]) sc.parallelize(range(1), 1).mapPartitions(task).count() ``` Before: The job will get stuck and press Ctrl+C to exit the job but the python worker process do not exit. After: The job finish correctly. The "cat" print nothing (because the dummay stdin is "/dev/null"). The python worker process exit normally. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25138 from WeichenXu123/SPARK-26175. Authored-by: WeichenXu Signed-off-by: HyukjinKwon --- python/pyspark/daemon.py | 15 +++ python/pyspark/sql/tests/test_udf.py | 12 2 files changed, 27 insertions(+) diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py index 6f42ad3..97b6b25 100644 --- a/python/pyspark/daemon.py +++ b/python/pyspark/daemon.py @@ -160,6 +160,21 @@ def manager(): if pid == 0: # in child process listen_sock.close() + +# It should close the standard input in the child process so that +# Python native function executions stay intact. +# +# Note that if we just close the standard input (file descriptor 0), +# the lowest file descriptor (file descriptor 0) will be allocated, +# later when other file descriptors should happen to open. +# +# Therefore, here we redirects it to '/dev/null' by duplicating +# another file descriptor for '/dev/null' to the standard input (0). +# See SPARK-26175. +devnull = open(os.devnull, 'r') +os.dup2(devnull.fileno(), 0) +devnull.close() + try: # Acknowledge that the fork was successful outfile = sock.makefile(mode="wb") diff --git a/python/pyspark/sql/tests/test_udf.py b/python/pyspark/sql/tests/test_udf.py index 803d471..1999311 100644 --- a/python/pyspark/sql/tests/test_udf.py +++ b/python/pyspark/sql/tests/test_udf.py @@ -616,6 +616,18 @@ class UDFTests(ReusedSQLTestCase): self.spark.range(1).select(f()).collect() +def test_worker_original_stdin_closed(self): +# Test if it closes the original standard input of worker inherited from the daemon, +# and replaces it with '/dev/null'. See SPARK-26175. +def task(iterator): +import sys +res = sys.stdin.read() +# Because the standard input is '/dev/null', it reaches to EOF. +assert res == '', "Expect read EOF from stdin." +return iterator + +self.sc.parallelize(range(1), 1).mapPartitions(task).count() + class UDFInitializationTests(unittest.TestCase): def tearDown(self): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR][DOC] Fix a typo 'lister' -> 'listener'
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e58dd4a [MINOR][DOC] Fix a typo 'lister' -> 'listener' e58dd4a is described below commit e58dd4af60dbb0e3378184c56ae9ded3db288852 Author: Yishuang Lu AuthorDate: Thu Aug 8 11:12:18 2019 +0900 [MINOR][DOC] Fix a typo 'lister' -> 'listener' ## What changes were proposed in this pull request? Fix the typo in java doc. ## How was this patch tested? N/A Signed-off-by: Yishuang Lu Closes #25377 from lys0716/dev. Authored-by: Yishuang Lu Signed-off-by: HyukjinKwon --- .../spark/sql/execution/datasources/parquet/ParquetFileFormat.scala | 4 ++-- .../datasources/v2/parquet/ParquetPartitionReaderFactory.scala| 4 ++-- .../apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala index 9caa34b..815b62d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala @@ -315,7 +315,7 @@ class ParquetFileFormat val vectorizedReader = new VectorizedParquetRecordReader( convertTz.orNull, enableOffHeapColumnVector && taskContext.isDefined, capacity) val iter = new RecordReaderIterator(vectorizedReader) -// SPARK-23457 Register a task completion lister before `initialization`. +// SPARK-23457 Register a task completion listener before `initialization`. taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close())) vectorizedReader.initialize(split, hadoopAttemptContext) logDebug(s"Appending $partitionSchema ${file.partitionValues}") @@ -337,7 +337,7 @@ class ParquetFileFormat new ParquetRecordReader[UnsafeRow](readSupport) } val iter = new RecordReaderIterator(reader) -// SPARK-23457 Register a task completion lister before `initialization`. +// SPARK-23457 Register a task completion listener before `initialization`. taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close())) reader.initialize(split, hadoopAttemptContext) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala index 4a281ba..a0f19c3 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala @@ -197,7 +197,7 @@ case class ParquetPartitionReaderFactory( new ParquetRecordReader[UnsafeRow](readSupport) } val iter = new RecordReaderIterator(reader) -// SPARK-23457 Register a task completion lister before `initialization`. +// SPARK-23457 Register a task completion listener before `initialization`. taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close())) reader } @@ -219,7 +219,7 @@ case class ParquetPartitionReaderFactory( val vectorizedReader = new VectorizedParquetRecordReader( convertTz.orNull, enableOffHeapColumnVector && taskContext.isDefined, capacity) val iter = new RecordReaderIterator(vectorizedReader) -// SPARK-23457 Register a task completion lister before `initialization`. +// SPARK-23457 Register a task completion listener before `initialization`. taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close())) logDebug(s"Appending $partitionSchema $partitionValues") vectorizedReader diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala index da2f221..7801d96 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala @@ -33,7 +33,7 @@ class StreamingQueryListenersConfSuite extends StreamTest with BeforeAndAfter { super.sparkConf.set(STREAMING_QUERY_LISTENERS.key, "org.apache.spark.sql.streaming.TestListener") - test("test if the configured query lister i
[spark] branch master updated: [SPARK-28454][PYTHON] Validate LongType in `createDataFrame(verifySchema=True)`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bda5b51 [SPARK-28454][PYTHON] Validate LongType in `createDataFrame(verifySchema=True)` bda5b51 is described below commit bda5b51576e525724315d4892e34c8fa7e27f0c7 Author: Anton Yanchenko AuthorDate: Thu Aug 8 11:47:25 2019 +0900 [SPARK-28454][PYTHON] Validate LongType in `createDataFrame(verifySchema=True)` ## What changes were proposed in this pull request? Add missing validation for `LongType` in `pyspark.sql.types._make_type_verifier`. ## How was this patch tested? Doctests / unittests / manual tests. Unpatched version: ``` In [23]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() Out[23]: [Row(x=None)] ``` Patched: ``` In [5]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() --- ValueErrorTraceback (most recent call last) in > 1 s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect() /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema) 689 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio) 690 else: --> 691 rdd, schema = self._createFromLocal(map(prepare, data), schema) 692 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) 693 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in _createFromLocal(self, data, schema) 405 # make sure data could consumed multiple times 406 if not isinstance(data, list): --> 407 data = list(data) 408 409 if schema is None or isinstance(schema, (list, tuple)): /usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in prepare(obj) 671 672 def prepare(obj): --> 673 verify_func(obj) 674 return obj 675 elif isinstance(schema, DataType): /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj) 1427 def verify(obj): 1428 if not verify_nullability(obj): -> 1429 verify_value(obj) 1430 1431 return verify /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_struct(obj) 1397 if isinstance(obj, dict): 1398 for f, verifier in verifiers: -> 1399 verifier(obj.get(f)) 1400 elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): 1401 # the order in obj could be different than dataType.fields /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj) 1427 def verify(obj): 1428 if not verify_nullability(obj): -> 1429 verify_value(obj) 1430 1431 return verify /usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_long(obj) 1356 if obj < -9223372036854775808 or obj > 9223372036854775807: 1357 raise ValueError( -> 1358 new_msg("object of LongType out of range, got: %s" % obj)) 1359 1360 verify_value = verify_long ValueError: field x: object of LongType out of range, got: 18446744073709551616 ``` Closes #25117 from simplylizz/master. Authored-by: Anton Yanchenko Signed-off-by: HyukjinKwon --- docs/sql-migration-guide-upgrade.md| 2 ++ python/pyspark/sql/tests/test_types.py | 3 ++- python/pyspark/sql/types.py| 14 ++ 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/docs/sql-migration-guide-upgrade.md b/docs/sql-migration-guide-upgrade.md index efd88d0..b2bd8ce 100644 --- a/docs/sql-migration-guide-upgrade.md +++ b/docs/sql-migration-guide-upgrade.md @@ -149,6 +149,8 @@ license: | - Since Spark 3.0, if files or subdirectories disappear during recursive directory listing (i.e. they appear in an intermediate listing but then cannot be read or listed during later phases of the recursive directory listing, due to either concurrent file deletions or object store consistency issues) then the listing will fail with an exception unless `spark.sql.files.ig
[spark] branch master updated: [SPARK-28395][FOLLOW-UP][SQL] Make spark.sql.function.preferIntegralDivision internal
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3586cdd [SPARK-28395][FOLLOW-UP][SQL] Make spark.sql.function.preferIntegralDivision internal 3586cdd is described below commit 3586cdd24d9f5cb7d3f642a3da6a26ced1f88cea Author: Yuming Wang AuthorDate: Thu Aug 8 10:42:24 2019 +0900 [SPARK-28395][FOLLOW-UP][SQL] Make spark.sql.function.preferIntegralDivision internal ## What changes were proposed in this pull request? This PR makes `spark.sql.function.preferIntegralDivision` to internal configuration because it is only used for PostgreSQL test cases. More details: https://github.com/apache/spark/pull/25158#discussion_r309764541 ## How was this patch tested? N/A Closes #25376 from wangyum/SPARK-28395-2. Authored-by: Yuming Wang Signed-off-by: HyukjinKwon --- .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 80a7d4e..f779bc8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -1532,8 +1532,9 @@ object SQLConf { .createWithDefault(false) val PREFER_INTEGRAL_DIVISION = buildConf("spark.sql.function.preferIntegralDivision") +.internal() .doc("When true, will perform integral division with the / operator " + - "if both sides are integral types.") + "if both sides are integral types. This is for PostgreSQL test cases only.") .booleanConf .createWithDefault(false) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28474][SQL] Hive 0.12 JDBC client can not handle binary type
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c4acfe7 [SPARK-28474][SQL] Hive 0.12 JDBC client can not handle binary type c4acfe7 is described below commit c4acfe7761bde41e9d26c5ab3a670ab86165cf9b Author: Yuming Wang AuthorDate: Thu Aug 8 17:01:25 2019 +0900 [SPARK-28474][SQL] Hive 0.12 JDBC client can not handle binary type ## What changes were proposed in this pull request? This PR fix Hive 0.12 JDBC client can not handle binary type: ```sql Connected to: Hive (version 3.0.0-SNAPSHOT) Driver: Hive (version 0.12.0) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 0.12.0 by Apache Hive 0: jdbc:hive2://localhost:1> SELECT cast('ABC' as binary); Error: java.lang.ClassCastException: [B incompatible with java.lang.String (state=,code=0) ``` Server log: ``` 19/08/07 10:10:04 WARN ThriftCLIService: Error fetching results: java.lang.RuntimeException: java.lang.ClassCastException: [B incompatible with java.lang.String at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) at java.security.AccessController.doPrivileged(AccessController.java:770) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) at com.sun.proxy.$Proxy26.fetchResults(Unknown Source) at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:455) at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:819) Caused by: java.lang.ClassCastException: [B incompatible with java.lang.String at org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:198) at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$getNextRowSet$1(SparkExecuteStatementOperation.scala:151) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$Lambda$1923.9113BFE0.apply(Unknown Source) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withSchedulerPool(SparkExecuteStatementOperation.scala:299) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:113) at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220) at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:785) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) ... 18 more ``` ## How was this patch tested? unit tests Closes #25379 from wangyum/SPARK-28474. Authored-by: Yuming Wang Signed-off-by: HyukjinKwon --- .../SparkThriftServerProtocolVersionsSuite.scala | 14 -- .../main/java/org/apache/hive/service/cli/ColumnValue.java | 5 - .../main/java/org/apache/hive/service/cli/ColumnValue.java | 5 - 3 files changed, 20 insertions(+), 4 deletions(-) diff --git a/sql/hive-thriftserver/
[spark] branch master updated (c4acfe7 -> 1941d35)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c4acfe7 [SPARK-28474][SQL] Hive 0.12 JDBC client can not handle binary type add 1941d35 [SPARK-28644][SQL] Port HIVE-10646: ColumnValue does not handle NULL_TYPE No new revisions were added by this update. Summary of changes: .../sql/hive/thriftserver/SparkThriftServerProtocolVersionsSuite.scala | 3 +-- .../v1.2.1/src/main/java/org/apache/hive/service/cli/ColumnValue.java | 2 ++ 2 files changed, 3 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org