from:"gurwls223"

[spark] branch branch-2.3 updated: [SPARK-28357][CORE][TEST] Fix Flaky Test - FileAppenderSuite.rollingfile appender - size-based rolling compressed

2019-07-12 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.3 by this push:
 new 8821a8c  [SPARK-28357][CORE][TEST] Fix Flaky Test - 
FileAppenderSuite.rollingfile appender - size-based rolling compressed
8821a8c is described below

commit 8821a8cb8bd56e9960a67961fc604a42625438f9
Author: Dongjoon Hyun 
AuthorDate: Fri Jul 12 18:40:07 2019 +0900

[SPARK-28357][CORE][TEST] Fix Flaky Test - FileAppenderSuite.rollingfile 
appender - size-based rolling compressed

## What changes were proposed in this pull request?

`SizeBasedRollingPolicy.shouldRollover` returns false when the size is 
equal to `rolloverSizeBytes`.
```scala
  /** Should rollover if the next set of bytes is going to exceed the size 
limit */
  def shouldRollover(bytesToBeWritten: Long): Boolean = {
logDebug(s"$bytesToBeWritten + $bytesWrittenSinceRollover > 
$rolloverSizeBytes")
bytesToBeWritten + bytesWrittenSinceRollover > rolloverSizeBytes
  }
```
- 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107553/testReport/org.apache.spark.util/FileAppenderSuite/rolling_file_appender___size_based_rolling__compressed_/
```
org.scalatest.exceptions.TestFailedException: 1000 was not less than 1000
```

## How was this patch tested?

Pass the Jenkins with the updated test.

Closes #25125 from dongjoon-hyun/SPARK-28357.

Authored-by: Dongjoon Hyun 
Signed-off-by: HyukjinKwon 
(cherry picked from commit 1c29212394adcbde2de4f4dfdc43a1cf32671ae1)
Signed-off-by: HyukjinKwon 
---
 core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala 
b/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala
index 52cd537..1a3e880 100644
--- a/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala
+++ b/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala
@@ -128,7 +128,7 @@ class FileAppenderSuite extends SparkFunSuite with 
BeforeAndAfter with Logging {
 val files = testRolling(appender, testOutputStream, textToAppend, 0, 
isCompressed = true)
 files.foreach { file =>
   logInfo(file.toString + ": " + file.length + " bytes")
-  assert(file.length < rolloverSize)
+  assert(file.length <= rolloverSize)
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (fe22faa -> 1c29212)

2019-07-12 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from fe22faa  [SPARK-28034][SQL][TEST] Port with.sql
 add 1c29212  [SPARK-28357][CORE][TEST] Fix Flaky Test - 
FileAppenderSuite.rollingfile appender - size-based rolling compressed

No new revisions were added by this update.

Summary of changes:
 core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (1c29212 -> 13ae9eb)

2019-07-12 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 1c29212  [SPARK-28357][CORE][TEST] Fix Flaky Test - 
FileAppenderSuite.rollingfile appender - size-based rolling compressed
 add 13ae9eb  [SPARK-28354][INFRA] Use JIRA user name instead of JIRA user 
key

No new revisions were added by this update.

Summary of changes:
 dev/merge_spark_pr.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28270][TEST-MAVEN][FOLLOW-UP][SQL][PYTHON][TESTS] Avoid cast input of UDF as double in the failed test in udf-aggregate_part1.sql

2019-07-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 27e41d6  [SPARK-28270][TEST-MAVEN][FOLLOW-UP][SQL][PYTHON][TESTS] 
Avoid cast input of UDF as double in the failed test in udf-aggregate_part1.sql
27e41d6 is described below

commit 27e41d65f158e8f0b04e73df195d3651c7eae485
Author: HyukjinKwon 
AuthorDate: Fri Jul 12 14:33:16 2019 +0900

[SPARK-28270][TEST-MAVEN][FOLLOW-UP][SQL][PYTHON][TESTS] Avoid cast input 
of UDF as double in the failed test in udf-aggregate_part1.sql

## What changes were proposed in this pull request?

It still can be flaky on certain environments due to float limitation 
described at https://github.com/apache/spark/pull/25110 . See 
https://github.com/apache/spark/pull/25110#discussion_r302735905

- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6584/testReport/org.apache.spark.sql/SQLQueryTestSuite/udf_pgSQL_udf_aggregates_part1_sql___Regular_Python_UDF/

```
Expected "7000[6] 1", but got "7000[5] 1" Result did not 
match for query #33SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS long), 
CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3))FROM (VALUES 
(70005), (70007)) v(x)
```

Here;s what's going on: 
https://github.com/apache/spark/pull/25110#discussion_r302791930

```
scala> Seq("70004.999", 
"70006.999").toDF().selectExpr("CAST(avg(value) AS long)").show()
+--+
|CAST(avg(value) AS BIGINT)|
+--+
| 70005|
+--+
```

Therefore, this PR just avoid to cast in the specific test.

This is a temp fix. We need more robust way to avoid such cases.

## How was this patch tested?

It passes with Maven in my local before/after this PR. I believe the 
problem seems similarly the Python or OS installed in the machine. I should 
test this against PR builder with `test-maven` for sure..

Closes #25128 from HyukjinKwon/SPARK-28270-2.

Authored-by: HyukjinKwon 
Signed-off-by: HyukjinKwon 
---
 .../resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part1.sql | 2 +-
 .../sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part1.sql
 
b/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part1.sql
index 5b97d3d..3e87733 100644
--- 
a/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part1.sql
+++ 
b/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part1.sql
@@ -78,7 +78,7 @@ FROM (VALUES ('-Infinity'), ('Infinity')) v(x);
 -- test accuracy with a large input offset
 SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS int), CAST(udf(var_pop(CAST(x AS 
DOUBLE))) AS decimal(10,3))
 FROM (VALUES (10003), (10004), (10006), (10007)) v(x);
-SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS long), CAST(udf(var_pop(CAST(x AS 
DOUBLE))) AS decimal(10,3))
+SELECT CAST(avg(udf(x)) AS long), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS 
decimal(10,3))
 FROM (VALUES (70005), (70007)) v(x);
 
 -- SQL2003 binary aggregates [SPARK-23907]
diff --git 
a/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out
 
b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out
index 98e04b4..5c08245 100644
--- 
a/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out
@@ -271,10 +271,10 @@ struct
+struct
 -- !query 33 output
 70006  1
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28378][PYTHON] Remove usage of cgi.escape

2019-07-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 707411f  [SPARK-28378][PYTHON] Remove usage of cgi.escape
707411f is described below

commit 707411f4793b7d45e19bc5f368e1ec9fc9827737
Author: Liang-Chi Hsieh 
AuthorDate: Sun Jul 14 15:26:00 2019 +0900

[SPARK-28378][PYTHON] Remove usage of cgi.escape

## What changes were proposed in this pull request?

`cgi.escape` is deprecated [1], and removed at 3.8 [2]. We better to 
replace it.

[1] https://docs.python.org/3/library/cgi.html#cgi.escape.
[2] https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals

## How was this patch tested?

Existing tests.

Closes #25142 from viirya/remove-cgi-escape.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: HyukjinKwon 
---
 python/pyspark/sql/dataframe.py | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index f8be8ee..3984712 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -22,8 +22,10 @@ if sys.version >= '3':
 basestring = unicode = str
 long = int
 from functools import reduce
+from html import escape as html_escape
 else:
 from itertools import imap as map
+from cgi import escape as html_escape
 
 import warnings
 
@@ -375,7 +377,6 @@ class DataFrame(object):
 by 'spark.sql.repl.eagerEval.enabled', this only called by REPL you are
 using support eager evaluation with HTML.
 """
-import cgi
 if not self._support_repr_html:
 self._support_repr_html = True
 if self.sql_ctx._conf.isReplEagerEvalEnabled():
@@ -390,11 +391,11 @@ class DataFrame(object):
 
 html = "\n"
 # generate table head
-html += "%s\n" % "".join(map(lambda x: 
cgi.escape(x), head))
+html += "%s\n" % "".join(map(lambda x: 
html_escape(x), head))
 # generate table rows
 for row in row_data:
 html += "%s\n" % "".join(
-map(lambda x: cgi.escape(x), row))
+map(lambda x: html_escape(x), row))
 html += "\n"
 if has_more_data:
 html += "only showing top %d %s\n" % (


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-28378][PYTHON] Remove usage of cgi.escape

2019-07-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 35d5886  [SPARK-28378][PYTHON] Remove usage of cgi.escape
35d5886 is described below

commit 35d5886dd4b7335f59fa92fa9993b578cfbf67d6
Author: Liang-Chi Hsieh 
AuthorDate: Sun Jul 14 15:26:00 2019 +0900

[SPARK-28378][PYTHON] Remove usage of cgi.escape

## What changes were proposed in this pull request?

`cgi.escape` is deprecated [1], and removed at 3.8 [2]. We better to 
replace it.

[1] https://docs.python.org/3/library/cgi.html#cgi.escape.
[2] https://docs.python.org/3.8/whatsnew/3.8.html#api-and-feature-removals

## How was this patch tested?

Existing tests.

Closes #25142 from viirya/remove-cgi-escape.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: HyukjinKwon 
---
 python/pyspark/sql/dataframe.py | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 1affc9b..45b90b5 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -22,8 +22,10 @@ if sys.version >= '3':
 basestring = unicode = str
 long = int
 from functools import reduce
+from html import escape as html_escape
 else:
 from itertools import imap as map
+from cgi import escape as html_escape
 
 import warnings
 
@@ -393,7 +395,6 @@ class DataFrame(object):
 by 'spark.sql.repl.eagerEval.enabled', this only called by REPL you are
 using support eager evaluation with HTML.
 """
-import cgi
 if not self._support_repr_html:
 self._support_repr_html = True
 if self.sql_ctx._conf.isReplEagerEvalEnabled():
@@ -408,11 +409,11 @@ class DataFrame(object):
 
 html = "\n"
 # generate table head
-html += "%s\n" % "".join(map(lambda x: 
cgi.escape(x), head))
+html += "%s\n" % "".join(map(lambda x: 
html_escape(x), head))
 # generate table rows
 for row in row_data:
 html += "%s\n" % "".join(
-map(lambda x: cgi.escape(x), row))
+map(lambda x: html_escape(x), row))
 html += "\n"
 if has_more_data:
 html += "only showing top %d %s\n" % (


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-26713][CORE][FOLLOWUP] revert the partial fix in ShuffleBlockFetcherIterator

2019-07-08 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d41bd7c  [SPARK-26713][CORE][FOLLOWUP] revert the partial fix in 
ShuffleBlockFetcherIterator
d41bd7c is described below

commit d41bd7c8910a960dec5b8605f1cb5d607a4ce958
Author: Wenchen Fan 
AuthorDate: Tue Jul 9 11:52:12 2019 +0900

[SPARK-26713][CORE][FOLLOWUP] revert the partial fix in 
ShuffleBlockFetcherIterator

## What changes were proposed in this pull request?

This PR reverts the partial bug fix in `ShuffleBlockFetcherIterator` which 
was introduced by https://github.com/apache/spark/pull/23638 .
The reasons:
1. It's a potential bug. After fixing `PipelinedRDD` in #23638 , the 
original problem was resolved.
2. The fix is incomplete according to [the 
discussion](https://github.com/apache/spark/pull/23638#discussion_r251869084)

We should fix the potential bug completely later.

## How was this patch tested?

existing tests

Closes #25049 from cloud-fan/revert.

Authored-by: Wenchen Fan 
Signed-off-by: HyukjinKwon 
---
 .../storage/ShuffleBlockFetcherIterator.scala  | 16 ++
 .../storage/ShuffleBlockFetcherIteratorSuite.scala | 58 --
 2 files changed, 3 insertions(+), 71 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
 
b/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
index a4d91a7..a283757 100644
--- 
a/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
+++ 
b/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
@@ -142,14 +142,7 @@ final class ShuffleBlockFetcherIterator(
 
   /**
* Whether the iterator is still active. If isZombie is true, the callback 
interface will no
-   * longer place fetched blocks into [[results]] and the iterator is marked 
as fully consumed.
-   *
-   * When the iterator is inactive, [[hasNext]] and [[next]] calls will honor 
that as there are
-   * cases the iterator is still being consumed. For example, ShuffledRDD + 
PipedRDD if the
-   * subprocess command is failed. The task will be marked as failed, then the 
iterator will be
-   * cleaned up at task completion, the [[next]] call (called in the stdin 
writer thread of
-   * PipedRDD if not exited yet) may hang at [[results.take]]. The defensive 
check in [[hasNext]]
-   * and [[next]] reduces the possibility of such race conditions.
+   * longer place fetched blocks into [[results]].
*/
   @GuardedBy("this")
   private[this] var isZombie = false
@@ -388,7 +381,7 @@ final class ShuffleBlockFetcherIterator(
 logDebug(s"Got local blocks in ${Utils.getUsedTimeNs(startTimeNs)}")
   }
 
-  override def hasNext: Boolean = !isZombie && (numBlocksProcessed < 
numBlocksToFetch)
+  override def hasNext: Boolean = numBlocksProcessed < numBlocksToFetch
 
   /**
* Fetches the next (BlockId, InputStream). If a task fails, the 
ManagedBuffers
@@ -412,7 +405,7 @@ final class ShuffleBlockFetcherIterator(
 // then fetch it one more time if it's corrupt, throw FailureFetchResult 
if the second fetch
 // is also corrupt, so the previous stage could be retried.
 // For local shuffle block, throw FailureFetchResult for the first 
IOException.
-while (!isZombie && result == null) {
+while (result == null) {
   val startFetchWait = System.nanoTime()
   result = results.take()
   val fetchWaitTime = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - 
startFetchWait)
@@ -505,9 +498,6 @@ final class ShuffleBlockFetcherIterator(
   fetchUpToMaxBytes()
 }
 
-if (result == null) { // the iterator is already closed/cleaned up.
-  throw new NoSuchElementException()
-}
 currentResult = result.asInstanceOf[SuccessFetchResult]
 (currentResult.blockId,
   new BufferReleasingInputStream(
diff --git 
a/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
 
b/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
index 3ab2f0b..ed40244 100644
--- 
a/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
@@ -215,64 +215,6 @@ class ShuffleBlockFetcherIteratorSuite extends 
SparkFunSuite with PrivateMethodT
 verify(blocks(ShuffleBlockId(0, 2, 0)), times(0)).release()
   }
 
-  test("iterator is all consumed if task completes early") {
-val blockManager = mock(classOf[BlockManager])
-val localBmId = BlockManagerId("test-client", "test-client", 1)
-doReturn(localBmId).when(blockManager).blockManagerId
-
-// Make su

[spark] branch master updated: [SPARK-28309][R][INFRA] Fix AppVeyor to run SparkR tests by avoiding to use devtools for testthat

2019-07-08 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f15102b  [SPARK-28309][R][INFRA] Fix AppVeyor to run SparkR tests by 
avoiding to use devtools for testthat
f15102b is described below

commit f15102b1702b64a54233ae31357e32335722f4e5
Author: HyukjinKwon 
AuthorDate: Tue Jul 9 12:06:46 2019 +0900

[SPARK-28309][R][INFRA] Fix AppVeyor to run SparkR tests by avoiding to use 
devtools for testthat

## What changes were proposed in this pull request?

Looks `devtools` 2.1.0 is released and then our AppVeyor users the latest 
one.
The problem is, they added `testthat` 2.1.1+ as its dependency - 
https://github.com/r-lib/devtools/blob/master/DESCRIPTION#L35

Usually it should remove and reinstall it properly when we install other 
packages; however, seems it's being failed in AppVeyor due to the previous 
installation for an unknown reason.

```
[00:01:41] > devtools::install_version('testthat', version = '1.0.2', 
repos='https://cloud.r-project.org/')
[00:01:44] Downloading package from url: 
https://cloud.r-project.org//src/contrib/Archive/testthat/testthat_1.0.2.tar.gz
...
[00:02:25] WARNING: moving package to final location failed, copying instead
[00:02:25] Warning in file.copy(instdir, dirname(final_instdir), recursive 
= TRUE,  :
[00:02:25]   problem copying 
c:\RLibrary\00LOCK-testthat\00new\testthat\libs\i386\testthat.dll to 
c:\RLibrary\testthat\libs\i386\testthat.dll: Permission denied
[00:02:25] ** testing if installed package can be loaded from final location
[00:02:25] *** arch - i386
[00:02:26] Error: package or namespace load failed for 'testthat' in 
FUN(X[[i]], ...):
[00:02:26]  no such symbol find_label_ in package 
c:/RLibrary/testthat/libs/i386/testthat.dll
[00:02:26] Error: loading failed
[00:02:26] Execution halted
[00:02:26] *** arch - x64
[00:02:26] ERROR: loading failed for 'i386'
[00:02:26] * removing 'c:/RLibrary/testthat'
[00:02:26] * restoring previous 'c:/RLibrary/testthat'
[00:02:26] Warning in file.copy(lp, dirname(pkgdir), recursive = TRUE, 
copy.date = TRUE) :
[00:02:26]   problem copying 
c:\RLibrary\00LOCK-testthat\testthat\libs\i386\testthat.dll to 
c:\RLibrary\testthat\libs\i386\testthat.dll: Permission denied
[00:02:26] Warning message:
[00:02:26] In i.p(...) :
[00:02:26]   installation of package 
'C:/Users/appveyor/AppData/Local/Temp/1/RtmpIx25hi/remotes5743d4a9b1/testthat' 
had non-zero exit status
```

See 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/25818746

Our SparkR testbed requires `testthat` 1.0.2 at most for the current status 
and `devtools` was installed at SPARK-22817 to pin the `testthat` version to 
1.0.2

Therefore, this PR works around the current issue by directly installing 
from the archive instead, and don't use `devtools`.

```R
 R -e 
"install.packages('https://cloud.r-project.org/src/contrib/Archive/testthat/testthat_1.0.2.tar.gz',
 repos=NULL, type='source')"
```

## How was this patch tested?

AppVeyor will test.

Closes #25081 from HyukjinKwon/SPARK-28309.

Authored-by: HyukjinKwon 
Signed-off-by: HyukjinKwon 
---
 appveyor.yml | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/appveyor.yml b/appveyor.yml
index bdb948b..8fb090c 100644
--- a/appveyor.yml
+++ b/appveyor.yml
@@ -42,9 +42,12 @@ install:
   # Install maven and dependencies
   - ps: .\dev\appveyor-install-dependencies.ps1
   # Required package for R unit tests
-  - cmd: R -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 'e1071', 
'survival'), repos='https://cloud.r-project.org/')"
+  - cmd: R -e "install.packages(c('knitr', 'rmarkdown', 'e1071', 'survival'), 
repos='https://cloud.r-project.org/')"
   # Here, we use the fixed version of testthat. For more details, please see 
SPARK-22817.
-  - cmd: R -e "devtools::install_version('testthat', version = '1.0.2', 
repos='https://cloud.r-project.org/')"
+  # As of devtools 2.1.0, it requires testthat higher then 2.1.1 as a 
dependency. SparkR test requires testthat 1.0.2.
+  # Therefore, we don't use devtools but installs it directly from the archive 
including its dependencies.
+  - cmd: R -e "install.packages(c('crayon', 'praise', 'R6'), 
repos='https://cloud.r-project.org/')"
+  - cmd: R -e 
"install.packages('https://cloud.r-project.org/src/contrib/Archive/testthat/testthat_1.0.2.tar.gz',
 repos=NULL, type='source')"
   - cmd: R -e "packageVersion('knitr'); packageVersion('rmarkdown'); 
packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival')"
 
 build_script:


---

[spark] branch master updated (d41bd7c -> 75ea02b)

2019-07-08 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d41bd7c  [SPARK-26713][CORE][FOLLOWUP] revert the partial fix in 
ShuffleBlockFetcherIterator
 add 75ea02b  [SPARK-28250][SQL] QueryPlan#references should exclude 
producedAttributes

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/catalyst/plans/QueryPlan.scala | 21 ++---
 .../plans/logical/basicLogicalOperators.scala   |  3 +--
 .../plans/logical/pythonLogicalOperators.scala  |  3 +--
 .../apache/spark/sql/execution/GenerateExec.scala   |  2 +-
 .../sql/execution/columnar/InMemoryRelation.scala   |  2 --
 .../org/apache/spark/sql/execution/objects.scala|  2 --
 .../spark/sql/execution/python/EvalPythonExec.scala |  3 +--
 7 files changed, 14 insertions(+), 22 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows

2019-07-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 55f92a3  [SPARK-28302][CORE] Make sure to generate unique output file 
for SparkLauncher on Windows
55f92a3 is described below

commit 55f92a31d7c1a6f02a9b0fc2ace6c5a5e0871ec4
Author: wuyi 
AuthorDate: Tue Jul 9 15:49:31 2019 +0900

[SPARK-28302][CORE] Make sure to generate unique output file for 
SparkLauncher on Windows

## What changes were proposed in this pull request?

When using SparkLauncher to submit applications **concurrently** with 
multiple threads under **Windows**, some apps would show that "The process 
cannot access the file because it is being used by another process" and remains 
in LOST state at the end. The issue can be reproduced by  this 
[demo](https://issues.apache.org/jira/secure/attachment/12973920/Main.scala).

After digging into the code, I find that, Windows cmd `%RANDOM%` would 
return the same number if we call it  instantly(e.g. < 500ms) after last call. 
As a result, SparkLauncher would get same output 
file(spark-class-launcher-output-%RANDOM%.txt) for apps. Then, the following 
app would hit the issue when it tries to write the same file which has already 
been opened for writing by another app.

We should make sure to generate unique output file for SparkLauncher on 
Windows to avoid this issue.

## How was this patch tested?

Tested manually on Windows.

Closes #25076 from Ngone51/SPARK-28302.

Authored-by: wuyi 
Signed-off-by: HyukjinKwon 
(cherry picked from commit 925f620570a022ff8229bfde076e7dde6bf242df)
Signed-off-by: HyukjinKwon 
---
 bin/spark-class2.cmd | 5 +
 1 file changed, 5 insertions(+)

diff --git a/bin/spark-class2.cmd b/bin/spark-class2.cmd
index 5da7d7a..34d04c9 100644
--- a/bin/spark-class2.cmd
+++ b/bin/spark-class2.cmd
@@ -63,7 +63,12 @@ if not "x%JAVA_HOME%"=="x" (
 
 rem The launcher library prints the command to be executed in a single line 
suitable for being
 rem executed by the batch interpreter. So read all the output of the launcher 
into a variable.
+:gen
 set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
+rem SPARK-28302: %RANDOM% would return the same number if we call it instantly 
after last call,
+rem so we should make it sure to generate unique file to avoid process 
collision of writing into
+rem the same file concurrently.
+if exist %LAUNCHER_OUTPUT% goto :gen
 "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main %* 
> %LAUNCHER_OUTPUT%
 for /f "tokens=*" %%i in (%LAUNCHER_OUTPUT%) do (
   set SPARK_CMD=%%i


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows

2019-07-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 925f620  [SPARK-28302][CORE] Make sure to generate unique output file 
for SparkLauncher on Windows
925f620 is described below

commit 925f620570a022ff8229bfde076e7dde6bf242df
Author: wuyi 
AuthorDate: Tue Jul 9 15:49:31 2019 +0900

[SPARK-28302][CORE] Make sure to generate unique output file for 
SparkLauncher on Windows

## What changes were proposed in this pull request?

When using SparkLauncher to submit applications **concurrently** with 
multiple threads under **Windows**, some apps would show that "The process 
cannot access the file because it is being used by another process" and remains 
in LOST state at the end. The issue can be reproduced by  this 
[demo](https://issues.apache.org/jira/secure/attachment/12973920/Main.scala).

After digging into the code, I find that, Windows cmd `%RANDOM%` would 
return the same number if we call it  instantly(e.g. < 500ms) after last call. 
As a result, SparkLauncher would get same output 
file(spark-class-launcher-output-%RANDOM%.txt) for apps. Then, the following 
app would hit the issue when it tries to write the same file which has already 
been opened for writing by another app.

We should make sure to generate unique output file for SparkLauncher on 
Windows to avoid this issue.

## How was this patch tested?

Tested manually on Windows.

Closes #25076 from Ngone51/SPARK-28302.

Authored-by: wuyi 
Signed-off-by: HyukjinKwon 
---
 bin/spark-class2.cmd | 5 +
 1 file changed, 5 insertions(+)

diff --git a/bin/spark-class2.cmd b/bin/spark-class2.cmd
index 5da7d7a..34d04c9 100644
--- a/bin/spark-class2.cmd
+++ b/bin/spark-class2.cmd
@@ -63,7 +63,12 @@ if not "x%JAVA_HOME%"=="x" (
 
 rem The launcher library prints the command to be executed in a single line 
suitable for being
 rem executed by the batch interpreter. So read all the output of the launcher 
into a variable.
+:gen
 set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
+rem SPARK-28302: %RANDOM% would return the same number if we call it instantly 
after last call,
+rem so we should make it sure to generate unique file to avoid process 
collision of writing into
+rem the same file concurrently.
+if exist %LAUNCHER_OUTPUT% goto :gen
 "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main %* 
> %LAUNCHER_OUTPUT%
 for /f "tokens=*" %%i in (%LAUNCHER_OUTPUT%) do (
   set SPARK_CMD=%%i


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.3 updated: [SPARK-28302][CORE] Make sure to generate unique output file for SparkLauncher on Windows

2019-07-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.3 by this push:
 new 74caacf  [SPARK-28302][CORE] Make sure to generate unique output file 
for SparkLauncher on Windows
74caacf is described below

commit 74caacf5c002c9ccec9a8f8b6f20cb3887469176
Author: wuyi 
AuthorDate: Tue Jul 9 15:49:31 2019 +0900

[SPARK-28302][CORE] Make sure to generate unique output file for 
SparkLauncher on Windows

## What changes were proposed in this pull request?

When using SparkLauncher to submit applications **concurrently** with 
multiple threads under **Windows**, some apps would show that "The process 
cannot access the file because it is being used by another process" and remains 
in LOST state at the end. The issue can be reproduced by  this 
[demo](https://issues.apache.org/jira/secure/attachment/12973920/Main.scala).

After digging into the code, I find that, Windows cmd `%RANDOM%` would 
return the same number if we call it  instantly(e.g. < 500ms) after last call. 
As a result, SparkLauncher would get same output 
file(spark-class-launcher-output-%RANDOM%.txt) for apps. Then, the following 
app would hit the issue when it tries to write the same file which has already 
been opened for writing by another app.

We should make sure to generate unique output file for SparkLauncher on 
Windows to avoid this issue.

## How was this patch tested?

Tested manually on Windows.

Closes #25076 from Ngone51/SPARK-28302.

Authored-by: wuyi 
Signed-off-by: HyukjinKwon 
(cherry picked from commit 925f620570a022ff8229bfde076e7dde6bf242df)
Signed-off-by: HyukjinKwon 
---
 bin/spark-class2.cmd | 5 +
 1 file changed, 5 insertions(+)

diff --git a/bin/spark-class2.cmd b/bin/spark-class2.cmd
index 5da7d7a..34d04c9 100644
--- a/bin/spark-class2.cmd
+++ b/bin/spark-class2.cmd
@@ -63,7 +63,12 @@ if not "x%JAVA_HOME%"=="x" (
 
 rem The launcher library prints the command to be executed in a single line 
suitable for being
 rem executed by the batch interpreter. So read all the output of the launcher 
into a variable.
+:gen
 set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
+rem SPARK-28302: %RANDOM% would return the same number if we call it instantly 
after last call,
+rem so we should make it sure to generate unique file to avoid process 
collision of writing into
+rem the same file concurrently.
+if exist %LAUNCHER_OUTPUT% goto :gen
 "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main %* 
> %LAUNCHER_OUTPUT%
 for /f "tokens=*" %%i in (%LAUNCHER_OUTPUT%) do (
   set SPARK_CMD=%%i


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28233][BUILD] Upgrade maven-jar-plugin and maven-source-plugin

2019-07-03 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 90bd017  [SPARK-28233][BUILD] Upgrade maven-jar-plugin and 
maven-source-plugin
90bd017 is described below

commit 90bd017c107382b7accdadfb52b02df98e8ea7bb
Author: Yuming Wang 
AuthorDate: Wed Jul 3 21:46:11 2019 +0900

[SPARK-28233][BUILD] Upgrade maven-jar-plugin and maven-source-plugin

## What changes were proposed in this pull request?

This pr upgrade `maven-jar-plugin` to 3.1.2 and `maven-source-plugin` to 
3.1.0 to avoid:
- [MJAR-259](https://issues.apache.org/jira/browse/MJAR-259) – Archiving to 
jar is very slow
- [MSOURCES-119](https://issues.apache.org/jira/browse/MSOURCES-119) – 
Archiving to jar is very slow

Release notes:
https://blogs.apache.org/maven/entry/apache-maven-source-plugin-version
https://blogs.apache.org/maven/entry/apache-maven-jar-plugin-version2
https://blogs.apache.org/maven/entry/apache-maven-jar-plugin-version1

## How was this patch tested?

N/A

Closes #25031 from wangyum/SPARK-28233.

Authored-by: Yuming Wang 
Signed-off-by: HyukjinKwon 
---
 pom.xml | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/pom.xml b/pom.xml
index 087c2ca..88a3954 100644
--- a/pom.xml
+++ b/pom.xml
@@ -2431,7 +2431,7 @@
 
   org.apache.maven.plugins
   maven-jar-plugin
-  3.1.0
+  3.1.2
 
 
   org.apache.maven.plugins
@@ -2441,7 +2441,7 @@
 
   org.apache.maven.plugins
   maven-source-plugin
-  3.0.1
+  3.1.0
   
 true
   
@@ -2600,7 +2600,7 @@
   
 org.apache.maven.plugins
 maven-jar-plugin
-3.1.0
+3.1.2
 
   test-jar
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28381][PYSPARK] Upgraded version of Pyrolite to 4.30

2019-07-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 591de42  [SPARK-28381][PYSPARK] Upgraded version of Pyrolite to 4.30
591de42 is described below

commit 591de423517de379b0900fbf5751b48492d15729
Author: Liang-Chi Hsieh 
AuthorDate: Mon Jul 15 12:29:58 2019 +0900

[SPARK-28381][PYSPARK] Upgraded version of Pyrolite to 4.30

## What changes were proposed in this pull request?

This upgraded to a newer version of Pyrolite. Most updates [1] in the newer 
version are for dotnot. For java, it includes a bug fix to Unpickler regarding 
cleaning up Unpickler memo, and support of protocol 5.

After upgrading, we can remove the fix at SPARK-27629 for the bug in 
Unpickler.

[1] https://github.com/irmen/Pyrolite/compare/pyrolite-4.23...master

## How was this patch tested?

Manually tested on Python 3.6 in local on existing tests.

Closes #25143 from viirya/upgrade-pyrolite.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: HyukjinKwon 
---
 core/pom.xml  | 2 +-
 core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala   | 3 ---
 dev/deps/spark-deps-hadoop-2.7| 2 +-
 dev/deps/spark-deps-hadoop-3.2| 2 +-
 .../main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala | 4 
 python/pyspark/sql/tests/test_serde.py| 4 
 .../org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala   | 4 
 7 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/core/pom.xml b/core/pom.xml
index 8a872de..4446dbd 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -378,7 +378,7 @@
 
   net.razorvine
   pyrolite
-  4.23
+  4.30
   
 
   net.razorvine
diff --git a/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala 
b/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
index 9462dfd..01e64b6 100644
--- a/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala
@@ -186,9 +186,6 @@ private[spark] object SerDeUtil extends Logging {
   val unpickle = new Unpickler
   iter.flatMap { row =>
 val obj = unpickle.loads(row)
-// `Opcodes.MEMOIZE` of Protocol 4 (Python 3.4+) will store objects in 
internal map
-// of `Unpickler`. This map is cleared when calling 
`Unpickler.close()`.
-unpickle.close()
 if (batched) {
   obj match {
 case array: Array[Any] => array.toSeq
diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7
index 2f660cc..79158bb 100644
--- a/dev/deps/spark-deps-hadoop-2.7
+++ b/dev/deps/spark-deps-hadoop-2.7
@@ -170,7 +170,7 @@ parquet-hadoop-bundle-1.6.0.jar
 parquet-jackson-1.10.1.jar
 protobuf-java-2.5.0.jar
 py4j-0.10.8.1.jar
-pyrolite-4.23.jar
+pyrolite-4.30.jar
 scala-compiler-2.12.8.jar
 scala-library-2.12.8.jar
 scala-parser-combinators_2.12-1.1.0.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2 b/dev/deps/spark-deps-hadoop-3.2
index e1e114f..5e03a59 100644
--- a/dev/deps/spark-deps-hadoop-3.2
+++ b/dev/deps/spark-deps-hadoop-3.2
@@ -189,7 +189,7 @@ parquet-hadoop-1.10.1.jar
 parquet-jackson-1.10.1.jar
 protobuf-java-2.5.0.jar
 py4j-0.10.8.1.jar
-pyrolite-4.23.jar
+pyrolite-4.30.jar
 re2j-1.1.jar
 scala-compiler-2.12.8.jar
 scala-library-2.12.8.jar
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 4c478a5..4617073 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -1357,10 +1357,6 @@ private[spark] abstract class SerDeBase {
   val unpickle = new Unpickler
   iter.flatMap { row =>
 val obj = unpickle.loads(row)
-// `Opcodes.MEMOIZE` of Protocol 4 (Python 3.4+) will store objects in 
internal map
-// of `Unpickler`. This map is cleared when calling 
`Unpickler.close()`. Pyrolite
-// doesn't clear it up, so we manually clear it.
-unpickle.close()
 if (batched) {
   obj match {
 case list: JArrayList[_] => list.asScala
diff --git a/python/pyspark/sql/tests/test_serde.py 
b/python/pyspark/sql/tests/test_serde.py
index f9bed76..ea2a686 100644
--- a/python/pyspark/sql/tests/test_serde.py
+++ b/python/pyspark/sql/tests/test_serde.py
@@ -128,10 +128,6 @@ class SerdeTests(ReusedSQLTestCase):
 
 def test_int_array_serialization(self):
 # Note that this test seems dependent on parallelism.
-

[spark] branch master updated: [SPARK-28392][SQL][TESTS] Add traits for UDF and PostgreSQL tests to share initialization

2019-07-15 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a7a02a8  [SPARK-28392][SQL][TESTS] Add traits for UDF and PostgreSQL 
tests to share initialization
a7a02a8 is described below

commit a7a02a86adafd3808051d843cf7e70176a7c4099
Author: HyukjinKwon 
AuthorDate: Mon Jul 15 16:20:09 2019 +0900

[SPARK-28392][SQL][TESTS] Add traits for UDF and PostgreSQL tests to share 
initialization

## What changes were proposed in this pull request?

This PR adds some traits so that we can deduplicate initialization stuff 
for each type of test case. For instance, see 
[SPARK-28343](https://issues.apache.org/jira/browse/SPARK-28343).

It's a little bit overkill but I think it will make adding test cases 
easier and cause less confusions.

This PR adds both:

```
  private trait PgSQLTest
  private trait UDFTest
```

To indicate and share the logics related to each combination of test types.

## How was this patch tested?

Manually tested.

Closes #25155 from HyukjinKwon/SPARK-28392.

Authored-by: HyukjinKwon 
Signed-off-by: HyukjinKwon 
---
 .../sql-tests/inputs/udf/pgSQL/udf-case.sql|   5 -
 .../sql-tests/results/udf/pgSQL/udf-case.sql.out   | 190 ++---
 .../org/apache/spark/sql/SQLQueryTestSuite.scala   |  56 --
 3 files changed, 129 insertions(+), 122 deletions(-)

diff --git 
a/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-case.sql 
b/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-case.sql
index b05c21d..a2aab79 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-case.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-case.sql
@@ -6,14 +6,10 @@
 -- 
https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/case.sql
 -- Test the CASE statement
 --
--- This test suite contains two Cartesian products without using explicit 
CROSS JOIN syntax.
--- Thus, we set spark.sql.crossJoin.enabled to true.
-
 -- This test file was converted from pgSQL/case.sql.
 -- Note that currently registered UDF returns a string. So there are some 
differences, for instance
 -- in string cast within UDF in Scala and Python.
 
-set spark.sql.crossJoin.enabled=true;
 CREATE TABLE CASE_TBL (
   i integer,
   f double
@@ -269,4 +265,3 @@ SELECT CASE
 
 DROP TABLE CASE_TBL;
 DROP TABLE CASE2_TBL;
-set spark.sql.crossJoin.enabled=false;
diff --git 
a/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-case.sql.out 
b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-case.sql.out
index 55bef64..6bb7a78 100644
--- a/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-case.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-case.sql.out
@@ -1,19 +1,22 @@
 -- Automatically generated by SQLQueryTestSuite
--- Number of queries: 37
+-- Number of queries: 35
 
 
 -- !query 0
-set spark.sql.crossJoin.enabled=true
+CREATE TABLE CASE_TBL (
+  i integer,
+  f double
+) USING parquet
 -- !query 0 schema
-struct
+struct<>
 -- !query 0 output
-spark.sql.crossJoin.enabledtrue
+
 
 
 -- !query 1
-CREATE TABLE CASE_TBL (
+CREATE TABLE CASE2_TBL (
   i integer,
-  f double
+  j integer
 ) USING parquet
 -- !query 1 schema
 struct<>
@@ -22,10 +25,7 @@ struct<>
 
 
 -- !query 2
-CREATE TABLE CASE2_TBL (
-  i integer,
-  j integer
-) USING parquet
+INSERT INTO CASE_TBL VALUES (1, 10.1)
 -- !query 2 schema
 struct<>
 -- !query 2 output
@@ -33,7 +33,7 @@ struct<>
 
 
 -- !query 3
-INSERT INTO CASE_TBL VALUES (1, 10.1)
+INSERT INTO CASE_TBL VALUES (2, 20.2)
 -- !query 3 schema
 struct<>
 -- !query 3 output
@@ -41,7 +41,7 @@ struct<>
 
 
 -- !query 4
-INSERT INTO CASE_TBL VALUES (2, 20.2)
+INSERT INTO CASE_TBL VALUES (3, -30.3)
 -- !query 4 schema
 struct<>
 -- !query 4 output
@@ -49,7 +49,7 @@ struct<>
 
 
 -- !query 5
-INSERT INTO CASE_TBL VALUES (3, -30.3)
+INSERT INTO CASE_TBL VALUES (4, NULL)
 -- !query 5 schema
 struct<>
 -- !query 5 output
@@ -57,7 +57,7 @@ struct<>
 
 
 -- !query 6
-INSERT INTO CASE_TBL VALUES (4, NULL)
+INSERT INTO CASE2_TBL VALUES (1, -1)
 -- !query 6 schema
 struct<>
 -- !query 6 output
@@ -65,7 +65,7 @@ struct<>
 
 
 -- !query 7
-INSERT INTO CASE2_TBL VALUES (1, -1)
+INSERT INTO CASE2_TBL VALUES (2, -2)
 -- !query 7 schema
 struct<>
 -- !query 7 output
@@ -73,7 +73,7 @@ struct<>
 
 
 -- !query 8
-INSERT INTO CASE2_TBL VALUES (2, -2)
+INSERT INTO CASE2_TBL VALUES (3, -3)
 -- !query 8 schema
 struct<>
 -- !query 8 output
@@ -81,7 +81,7 @@ struct<>
 
 
 -- !query 9
-INSERT INTO CASE2_TBL VALUES (3, -3)
+INSERT INTO CASE2_TBL VALUES (2, -4)
 -- !query 9 schema
 struct<>
 -- !query 9 outp

[spark] branch master updated: [SPARK-27532][DOC] Correct the default value in the Documentation for "spark.redaction.regex"

2019-04-21 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4cb1cd6  [SPARK-27532][DOC] Correct the default value in the 
Documentation for "spark.redaction.regex"
4cb1cd6 is described below

commit 4cb1cd6ab7b7bffd045a786b0ddb7c7783afdf46
Author: shivusondur 
AuthorDate: Sun Apr 21 16:56:12 2019 +0900

[SPARK-27532][DOC] Correct the default value in the Documentation for 
"spark.redaction.regex"

## What changes were proposed in this pull request?

Corrected the default value in the Documentation for "spark.redaction.regex"

## How was this patch tested?

NA

Closes #24428 from shivusondur/doc2.

Authored-by: shivusondur 
Signed-off-by: HyukjinKwon 
---
 docs/configuration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 5325f8a..2cb1a5f 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -472,7 +472,7 @@ Apart from these, the following properties are also 
available, and may be useful
 
 
   spark.redaction.regex
-  (?i)secret|password
+  (?i)secret|password|token
   
 Regex to decide which Spark configuration properties and environment 
variables in driver and
 executor environments contain sensitive information. When this regex 
matches a property key or


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-27527][SQL][DOCS] Improve descriptions of Timestamp and Date types

2019-04-21 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d61b3bc  [SPARK-27527][SQL][DOCS] Improve descriptions of Timestamp 
and Date types
d61b3bc is described below

commit d61b3bc875941d5a815e0e68fe7aa986e372b4e8
Author: Maxim Gekk 
AuthorDate: Sun Apr 21 16:53:11 2019 +0900

[SPARK-27527][SQL][DOCS] Improve descriptions of Timestamp and Date types

## What changes were proposed in this pull request?

In the PR, I propose more precise description of `TimestampType` and 
`DateType`, how they store timestamps and dates internally.

Closes #24424 from MaxGekk/timestamp-date-type-doc.

Authored-by: Maxim Gekk 
Signed-off-by: HyukjinKwon 
---
 .../scala/org/apache/spark/sql/types/DateType.scala  | 20 
 .../org/apache/spark/sql/types/TimestampType.scala   | 19 ++-
 2 files changed, 26 insertions(+), 13 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala
index 7491014..ba322fa 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateType.scala
@@ -23,19 +23,18 @@ import scala.reflect.runtime.universe.typeTag
 import org.apache.spark.annotation.Stable
 
 /**
- * A date type, supporting "0001-01-01" through "-12-31".
- *
- * Please use the singleton `DataTypes.DateType`.
- *
- * Internally, this is represented as the number of days from 1970-01-01.
+ * The date type represents a valid date in the proleptic Gregorian calendar.
+ * Valid range is [0001-01-01, -12-31].
  *
+ * Please use the singleton `DataTypes.DateType` to refer the type.
  * @since 1.3.0
  */
 @Stable
 class DateType private() extends AtomicType {
-  // The companion object and this class is separated so the companion object 
also subclasses
-  // this type. Otherwise, the companion object would be of type "DateType$" 
in byte code.
-  // Defined with a private constructor so the companion object is the only 
possible instantiation.
+  /**
+   * Internally, a date is stored as a simple incrementing count of days
+   * where day 0 is 1970-01-01. Negative numbers represent earlier days.
+   */
   private[sql] type InternalType = Int
 
   @transient private[sql] lazy val tag = typeTag[InternalType]
@@ -51,6 +50,11 @@ class DateType private() extends AtomicType {
 }
 
 /**
+ * The companion case object and the DateType class is separated so the 
companion object
+ * also subclasses the class. Otherwise, the companion object would be of type 
"DateType$"
+ * in byte code. The DateType class is defined with a private constructor so 
its companion
+ * object is the only possible instantiation.
+ *
  * @since 1.3.0
  */
 @Stable
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala
index a20f155..8dbe4dd 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala
@@ -23,16 +23,20 @@ import scala.reflect.runtime.universe.typeTag
 import org.apache.spark.annotation.Stable
 
 /**
- * The data type representing `java.sql.Timestamp` values.
- * Please use the singleton `DataTypes.TimestampType`.
+ * The timestamp type represents a time instant in microsecond precision.
+ * Valid range is [0001-01-01T00:00:00.00Z, -12-31T23:59:59.99Z] 
where
+ * the left/right-bound is a date and time of the proleptic Gregorian
+ * calendar in UTC+00:00.
  *
+ * Please use the singleton `DataTypes.TimestampType` to refer the type.
  * @since 1.3.0
  */
 @Stable
 class TimestampType private() extends AtomicType {
-  // The companion object and this class is separated so the companion object 
also subclasses
-  // this type. Otherwise, the companion object would be of type 
"TimestampType$" in byte code.
-  // Defined with a private constructor so the companion object is the only 
possible instantiation.
+  /**
+   * Internally, a timestamp is stored as the number of microseconds from
+   * the epoch of 1970-01-01T00:00:00.00Z (UTC+00:00)
+   */
   private[sql] type InternalType = Long
 
   @transient private[sql] lazy val tag = typeTag[InternalType]
@@ -48,6 +52,11 @@ class TimestampType private() extends AtomicType {
 }
 
 /**
+ * The companion case object and its class is separated so the companion 
object also subclasses
+ * the TimestampType class. Otherwise, the companion object would be of type 
"TimestampType$"
+ * in byte code. Defined with a private constructor so the companion object i

[spark] branch master updated: [SPARK-27522][SQL][TEST] Test migration from INT96 to TIMESTAMP_MICROS for timestamps in parquet

2019-04-22 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 777b797  [SPARK-27522][SQL][TEST] Test migration from INT96 to 
TIMESTAMP_MICROS for timestamps in parquet
777b797 is described below

commit 777b797867045c9717813e9dab2ab9012a1889fc
Author: Maxim Gekk 
AuthorDate: Mon Apr 22 16:34:13 2019 +0900

[SPARK-27522][SQL][TEST] Test migration from INT96 to TIMESTAMP_MICROS for 
timestamps in parquet

## What changes were proposed in this pull request?

Added tests to check migration from `INT96` to `TIMESTAMP_MICROS` (`INT64`) 
for timestamps in parquet files. In particular:
- Append `TIMESTAMP_MICROS` timestamps to **existing parquet** files with 
`INT96` timestamps
- Append `TIMESTAMP_MICROS` timestamps to a table with `INT96` timestamps
- Append `INT96` to `TIMESTAMP_MICROS` timestamps in **parquet files**
- Append `INT96` to `TIMESTAMP_MICROS` timestamps in a **table**

Closes #24417 from MaxGekk/parquet-timestamp-int64-tests.

Authored-by: Maxim Gekk 
Signed-off-by: HyukjinKwon 
---
 .../datasources/parquet/ParquetQuerySuite.scala| 43 ++
 1 file changed, 43 insertions(+)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
index 8cc3bee..4959275 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
@@ -18,6 +18,7 @@
 package org.apache.spark.sql.execution.datasources.parquet
 
 import java.io.File
+import java.util.concurrent.TimeUnit
 
 import org.apache.hadoop.fs.{FileSystem, Path}
 import org.apache.parquet.hadoop.ParquetOutputFormat
@@ -881,6 +882,48 @@ class ParquetQuerySuite extends QueryTest with ParquetTest 
with SharedSQLContext
 }
   }
 
+  test("Migration from INT96 to TIMESTAMP_MICROS timestamp type") {
+def testMigration(fromTsType: String, toTsType: String): Unit = {
+  def checkAppend(write: DataFrameWriter[_] => Unit, readback: => 
DataFrame): Unit = {
+def data(start: Int, end: Int): Seq[Row] = (start to end).map { i =>
+  val ts = new java.sql.Timestamp(TimeUnit.SECONDS.toMillis(i))
+  ts.setNanos(123456000)
+  Row(ts)
+}
+val schema = new StructType().add("time", TimestampType)
+val df1 = spark.createDataFrame(sparkContext.parallelize(data(0, 1)), 
schema)
+withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> fromTsType) {
+  write(df1.write)
+}
+val df2 = spark.createDataFrame(sparkContext.parallelize(data(2, 10)), 
schema)
+withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> toTsType) {
+  write(df2.write.mode(SaveMode.Append))
+}
+Seq("true", "false").foreach { vectorized =>
+  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
vectorized) {
+checkAnswer(readback, df1.unionAll(df2))
+  }
+}
+  }
+
+  Seq(false, true).foreach { mergeSchema =>
+withTempPath { file =>
+  checkAppend(_.parquet(file.getCanonicalPath),
+spark.read.option("mergeSchema", 
mergeSchema).parquet(file.getCanonicalPath))
+}
+
+withSQLConf(SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> 
mergeSchema.toString) {
+  val tableName = "parquet_timestamp_migration"
+  withTable(tableName) {
+checkAppend(_.saveAsTable(tableName), spark.table(tableName))
+  }
+}
+  }
+}
+
+testMigration(fromTsType = "INT96", toTsType = "TIMESTAMP_MICROS")
+testMigration(fromTsType = "TIMESTAMP_MICROS", toTsType = "INT96")
+  }
 }
 
 object TestingUDT {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds

2019-04-22 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d36cce1  [SPARK-27276][PYTHON][SQL] Increase minimum version of 
pyarrow to 0.12.1 and remove prior workarounds
d36cce1 is described below

commit d36cce18e262dc9cbd687ef42f8b67a62f0a3e22
Author: Bryan Cutler 
AuthorDate: Mon Apr 22 19:30:31 2019 +0900

[SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 
and remove prior workarounds

## What changes were proposed in this pull request?

This increases the minimum support version of pyarrow to 0.12.1 and removes 
workarounds in pyspark to remain compatible with prior versions. This means 
that users will need to have at least pyarrow 0.12.1 installed and available in 
the cluster or an `ImportError` will be raised to indicate an upgrade is needed.

## How was this patch tested?

Existing tests using:
Python 2.7.15, pyarrow 0.12.1, pandas 0.24.2
Python 3.6.7, pyarrow 0.12.1, pandas 0.24.0

Closes #24298 from BryanCutler/arrow-bump-min-pyarrow-SPARK-27276.

Authored-by: Bryan Cutler 
Signed-off-by: HyukjinKwon 
---
 python/pyspark/serializers.py  | 53 ++
 python/pyspark/sql/dataframe.py|  8 +--
 python/pyspark/sql/session.py  |  7 +--
 python/pyspark/sql/tests/test_arrow.py | 39 +++---
 python/pyspark/sql/tests/test_pandas_udf.py| 48 ++---
 .../sql/tests/test_pandas_udf_grouped_map.py   | 28 --
 python/pyspark/sql/tests/test_pandas_udf_scalar.py | 30 +++
 python/pyspark/sql/types.py| 63 --
 python/pyspark/sql/utils.py|  2 +-
 python/setup.py|  7 +--
 10 files changed, 67 insertions(+), 218 deletions(-)

diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index ed419db..6058e94 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -260,10 +260,14 @@ class ArrowStreamPandasSerializer(ArrowStreamSerializer):
 self._safecheck = safecheck
 self._assign_cols_by_name = assign_cols_by_name
 
-def arrow_to_pandas(self, arrow_column, data_type):
-from pyspark.sql.types import _arrow_column_to_pandas, 
_check_series_localize_timestamps
+def arrow_to_pandas(self, arrow_column):
+from pyspark.sql.types import _check_series_localize_timestamps
+
+# If the given column is a date type column, creates a series of 
datetime.date directly
+# instead of creating datetime64[ns] as intermediate data to avoid 
overflow caused by
+# datetime64[ns] type handling.
+s = arrow_column.to_pandas(date_as_object=True)
 
-s = _arrow_column_to_pandas(arrow_column, data_type)
 s = _check_series_localize_timestamps(s, self._timezone)
 return s
 
@@ -275,8 +279,6 @@ class ArrowStreamPandasSerializer(ArrowStreamSerializer):
 :param series: A single pandas.Series, list of Series, or list of 
(series, arrow_type)
 :return: Arrow RecordBatch
 """
-import decimal
-from distutils.version import LooseVersion
 import pandas as pd
 import pyarrow as pa
 from pyspark.sql.types import _check_series_convert_timestamps_internal
@@ -289,24 +291,10 @@ class ArrowStreamPandasSerializer(ArrowStreamSerializer):
 def create_array(s, t):
 mask = s.isnull()
 # Ensure timestamp series are in expected form for Spark internal 
representation
-# TODO: maybe don't need None check anymore as of Arrow 0.9.1
 if t is not None and pa.types.is_timestamp(t):
 s = _check_series_convert_timestamps_internal(s.fillna(0), 
self._timezone)
 # TODO: need cast after Arrow conversion, ns values cause 
error with pandas 0.19.2
 return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False)
-elif t is not None and pa.types.is_string(t) and sys.version < '3':
-# TODO: need decode before converting to Arrow in Python 2
-# TODO: don't need as of Arrow 0.9.1
-return pa.Array.from_pandas(s.apply(
-lambda v: v.decode("utf-8") if isinstance(v, str) else v), 
mask=mask, type=t)
-elif t is not None and pa.types.is_decimal(t) and \
-LooseVersion("0.9.0") <= LooseVersion(pa.__version__) < 
LooseVersion("0.10.0"):
-# TODO: see ARROW-2432. Remove when the minimum PyArrow 
version becomes 0.10.0.
-return pa.Array.from_pandas(s.apply(
-lambda v: decimal.D

[spark] branch master updated: [SPARK-27535][SQL][TEST] Date and timestamp JSON benchmarks

2019-04-22 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 93a264d  [SPARK-27535][SQL][TEST] Date and timestamp JSON benchmarks
93a264d is described below

commit 93a264d05a55c2617d34e977dbaf182987187a27
Author: Maxim Gekk 
AuthorDate: Tue Apr 23 11:09:14 2019 +0900

[SPARK-27535][SQL][TEST] Date and timestamp JSON benchmarks

## What changes were proposed in this pull request?

Added new JSON benchmarks related to date and timestamps operations:
- Write date/timestamp to JSON files
- `to_json()` and `from_json()` for dates and timestamps
- Read date/timestamps from JSON files, and infer schemas
- Parse and infer schemas from `Dataset[String]`

Also existing JSON benchmarks are ported on `NoOp` datasource.

Closes #24430 from MaxGekk/json-datetime-benchmark.

Authored-by: Maxim Gekk 
Signed-off-by: HyukjinKwon 
---
 sql/core/benchmarks/JSONBenchmark-results.txt  |  79 +
 .../execution/datasources/json/JsonBenchmark.scala | 126 -
 2 files changed, 179 insertions(+), 26 deletions(-)

diff --git a/sql/core/benchmarks/JSONBenchmark-results.txt 
b/sql/core/benchmarks/JSONBenchmark-results.txt
index 2b784c3..7846983 100644
--- a/sql/core/benchmarks/JSONBenchmark-results.txt
+++ b/sql/core/benchmarks/JSONBenchmark-results.txt
@@ -7,77 +7,106 @@ Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 
10.14.4
 Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
 JSON schema inferring:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding   51280  51722 
420  2.0 512.8   1.0X
-UTF-8 is set  75009  77276
1963  1.3 750.1   0.7X
+No encoding   50949  51086 
150  2.0 509.5   1.0X
+UTF-8 is set  72012  72147 
120  1.4 720.1   0.7X
 
 Preparing data for benchmarking ...
 Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
 Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
 count a short column: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding   39675  39738 
 83  2.5 396.7   1.0X
-UTF-8 is set  62755  64399
1436  1.6 627.5   0.6X
+No encoding   36799  36891 
 80  2.7 368.0   1.0X
+UTF-8 is set  59796  59880 
 74  1.7 598.0   0.6X
 
 Preparing data for benchmarking ...
 Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
 Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
 count a wide column:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding   56429  56468 
 65  0.25642.9   1.0X
-UTF-8 is set  81078  81454 
374  0.18107.8   0.7X
+No encoding   55803  55967 
152  0.25580.3   1.0X
+UTF-8 is set  80623  80825 
178  0.18062.3   0.7X
 
 Preparing data for benchmarking ...
 Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
 Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
 select wide row:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-No encoding   95329  95557 
265  0.0  190658.2   1.0X
-UTF-8 is set 102827 102967 
166  0.0  205654.2   0.9X
+No encoding   84263  85750
1476  0.0  168526.2   1.0X
+UTF-8 is set

[spark] branch master updated: [SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks

2019-04-22 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 55f26d8  [SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks
55f26d8 is described below

commit 55f26d809008d26e9727874128aee0a61dcfea00
Author: Maxim Gekk 
AuthorDate: Tue Apr 23 11:08:02 2019 +0900

[SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks

## What changes were proposed in this pull request?

Added new CSV benchmarks related to date and timestamps operations:
- Write date/timestamp to CSV files
- `to_csv()` and `from_csv()` for dates and timestamps
- Read date/timestamps from CSV files, and infer schemas
- Parse and infer schemas from `Dataset[String]`

Also existing CSV benchmarks are ported on `NoOp` datasource.

Closes #24429 from MaxGekk/csv-timestamp-benchmark.

Authored-by: Maxim Gekk 
Signed-off-by: HyukjinKwon 
---
 sql/core/benchmarks/CSVBenchmark-results.txt   |  73 +---
 .../execution/datasources/csv/CSVBenchmark.scala   | 201 ++---
 2 files changed, 229 insertions(+), 45 deletions(-)

diff --git a/sql/core/benchmarks/CSVBenchmark-results.txt 
b/sql/core/benchmarks/CSVBenchmark-results.txt
index 4fef15b..888c2ce 100644
--- a/sql/core/benchmarks/CSVBenchmark-results.txt
+++ b/sql/core/benchmarks/CSVBenchmark-results.txt
@@ -2,29 +2,58 @@
 Benchmark to measure CSV read/write performance
 

 
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-Parsing quoted values:   Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
-
-One quoted string   49754 / 50158  0.0  
995072.2   1.0X
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+Parsing quoted values:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+
+One quoted string 36998  37134 
120  0.0  739953.1   1.0X
 
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-Wide rows with 1000 columns: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
-
-Select 1000 columns   149402 / 151785  0.0  
149401.9   1.0X
-Select 100 columns  42986 / 43985  0.0   
42986.1   3.5X
-Select one column   33764 / 34057  0.0   
33763.6   4.4X
-count()   9332 / 9508  0.1
9332.2  16.0X
-Select 100 columns, one bad input field 50963 / 51512  0.0   
50962.5   2.9X
-Select 100 columns, corrupt record field69627 / 71029  0.0   
69627.5   2.1X
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
+Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
+Wide rows with 1000 columns:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+
+Select 1000 columns  140620 141162 
737  0.0  140620.5   1.0X
+Select 100 columns35170  35287 
183  0.0   35170.0   4.0X
+Select one column 27711  27927 
187  0.0   27710.9   5.1X
+count()7707   7804 
 84  0.17707.4  18.2X
+Select 100 columns, one bad input field   41762  41851 
117  0.0   41761.8   3.4X
+Select 100 columns, corrupt record field  48717  48761 
 44  0.0   48717.4   2.9X
 
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Linux 3.16.0-31-generic
-Intel(R) Xeon(R) CPU @ 2.50GHz
-Count a dataset with 10 columns: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
-
-Select 10 columns + count() 22588 / 22623  0.4
2258.8   1.0X
-Select 1 column + count

[spark] branch master updated: [SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default

2019-04-22 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 43a73e3  [SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS 
by default
43a73e3 is described below

commit 43a73e387cb843486adcf5b8bbd8b99010ce6e02
Author: Maxim Gekk 
AuthorDate: Tue Apr 23 11:06:39 2019 +0900

[SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default

## What changes were proposed in this pull request?

In the PR, I propose to use the `TIMESTAMP_MICROS` logical type for 
timestamps written to parquet files. The type matches semantically to 
Catalyst's `TimestampType`, and stores microseconds since epoch in UTC time 
zone. This will allow to avoid conversions of microseconds to nanoseconds and 
to Julian calendar. Also this will reduce sizes of written parquet files.

## How was this patch tested?

By existing test suites.

Closes #24425 from MaxGekk/parquet-timestamp_micros.

Authored-by: Maxim Gekk 
Signed-off-by: HyukjinKwon 
---
 docs/sql-migration-guide-upgrade.md   | 2 ++
 .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala| 2 +-
 .../datasources/parquet/ParquetInteroperabilitySuite.scala| 8 ++--
 .../test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala   | 2 +-
 4 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/docs/sql-migration-guide-upgrade.md 
b/docs/sql-migration-guide-upgrade.md
index 90a7d8d..54512ae 100644
--- a/docs/sql-migration-guide-upgrade.md
+++ b/docs/sql-migration-guide-upgrade.md
@@ -124,6 +124,8 @@ license: |
 
   - In Spark version 2.4, when a spark session is created via 
`cloneSession()`, the newly created spark session inherits its configuration 
from its parent `SparkContext` even though the same configuration may exist 
with a different value in its parent spark session. Since Spark 3.0, the 
configurations of a parent `SparkSession` have a higher precedence over the 
parent `SparkContext`.
 
+  - Since Spark 3.0, parquet logical type `TIMESTAMP_MICROS` is used by 
default while saving `TIMESTAMP` columns. In Spark version 2.4 and earlier, 
`TIMESTAMP` columns are saved as `INT96` in parquet files. To set `INT96` to 
`spark.sql.parquet.outputTimestampType` restores the previous behavior.
+
 ## Upgrading from Spark SQL 2.4 to 2.4.1
 
   - The value of `spark.executor.heartbeatInterval`, when specified without 
units like "30" rather than "30s", was
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index b223a48..9ebd2c0 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -405,7 +405,7 @@ object SQLConf {
 .stringConf
 .transform(_.toUpperCase(Locale.ROOT))
 .checkValues(ParquetOutputTimestampType.values.map(_.toString))
-.createWithDefault(ParquetOutputTimestampType.INT96.toString)
+.createWithDefault(ParquetOutputTimestampType.TIMESTAMP_MICROS.toString)
 
   val PARQUET_INT64_AS_TIMESTAMP_MILLIS = 
buildConf("spark.sql.parquet.int64AsTimestampMillis")
 .doc(s"(Deprecated since Spark 2.3, please set 
${PARQUET_OUTPUT_TIMESTAMP_TYPE.key}.) " +
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
index f06e186..09793bd 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala
@@ -120,8 +120,12 @@ class ParquetInteroperabilitySuite extends 
ParquetCompatibilityTest with SharedS
   ).map { s => java.sql.Timestamp.valueOf(s) }
   import testImplicits._
   // match the column names of the file from impala
-  val df = 
spark.createDataset(ts).toDF().repartition(1).withColumnRenamed("value", "ts")
-  df.write.parquet(tableDir.getAbsolutePath)
+  withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key ->
+SQLConf.ParquetOutputTimestampType.INT96.toString) {
+val df = spark.createDataset(ts).toDF().repartition(1)
+  .withColumnRenamed("value", "ts")
+df.write.parquet(tableDir.getAbsolutePath)
+  }
   FileUtils.copyFile(new File(impalaPath), new File(tableDir, 
"part-1.parq"))
 
   Seq(false, true).foreach { int96TimestampConversion =>
diff --git 
a/sql/core/src/test/scala/org/ap

[spark] branch master updated: [SPARK-27470][PYSPARK] Update pyrolite to 4.23

2019-04-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8718367  [SPARK-27470][PYSPARK] Update pyrolite to 4.23
8718367 is described below

commit 8718367e2e739f1ed82997b9f4a1298b7a1c4e49
Author: Sean Owen 
AuthorDate: Tue Apr 16 19:41:40 2019 +0900

[SPARK-27470][PYSPARK] Update pyrolite to 4.23

## What changes were proposed in this pull request?

 Update pyrolite to 4.23 to pick up bug and security fixes.

## How was this patch tested?

Existing tests.

Closes #24381 from srowen/SPARK-27470.

Authored-by: Sean Owen 
Signed-off-by: HyukjinKwon 
---
 core/pom.xml   | 2 +-
 dev/deps/spark-deps-hadoop-2.7 | 2 +-
 dev/deps/spark-deps-hadoop-3.2 | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/core/pom.xml b/core/pom.xml
index 45bda44..9d57028 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -347,7 +347,7 @@
 
   net.razorvine
   pyrolite
-  4.13
+  4.23
   
 
   net.razorvine
diff --git a/dev/deps/spark-deps-hadoop-2.7 b/dev/deps/spark-deps-hadoop-2.7
index 00dc2ce..8ae59cc 100644
--- a/dev/deps/spark-deps-hadoop-2.7
+++ b/dev/deps/spark-deps-hadoop-2.7
@@ -170,7 +170,7 @@ parquet-hadoop-bundle-1.6.0.jar
 parquet-jackson-1.10.1.jar
 protobuf-java-2.5.0.jar
 py4j-0.10.8.1.jar
-pyrolite-4.13.jar
+pyrolite-4.23.jar
 scala-compiler-2.12.8.jar
 scala-library-2.12.8.jar
 scala-parser-combinators_2.12-1.1.0.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2 b/dev/deps/spark-deps-hadoop-3.2
index 97085d6..bbb0d73 100644
--- a/dev/deps/spark-deps-hadoop-3.2
+++ b/dev/deps/spark-deps-hadoop-3.2
@@ -191,7 +191,7 @@ parquet-hadoop-bundle-1.6.0.jar
 parquet-jackson-1.10.1.jar
 protobuf-java-2.5.0.jar
 py4j-0.10.8.1.jar
-pyrolite-4.13.jar
+pyrolite-4.23.jar
 re2j-1.1.jar
 scala-compiler-2.12.8.jar
 scala-library-2.12.8.jar


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][BUILD] Update genjavadoc to 0.13

2019-04-23 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 596a5ff  [MINOR][BUILD] Update genjavadoc to 0.13
596a5ff is described below

commit 596a5ff2737e531fbca2f31db1eb9aadd8f08882
Author: Sean Owen 
AuthorDate: Wed Apr 24 13:44:48 2019 +0900

[MINOR][BUILD] Update genjavadoc to 0.13

## What changes were proposed in this pull request?

Kind of related to https://github.com/gatorsmile/spark/pull/5 - let's 
update genjavadoc to see if it generates fewer spurious javadoc errors to begin 
with.

## How was this patch tested?

Existing docs build

Closes #24443 from srowen/genjavadoc013.

Authored-by: Sean Owen 
Signed-off-by: HyukjinKwon 
---
 .../org/apache/spark/rpc/RpcCallContext.scala  |  2 +-
 .../spark/status/api/v1/ApiRootResource.scala  |  6 +++---
 .../org/apache/spark/util/SizeEstimator.scala  |  2 +-
 .../streaming/kinesis/SparkAWSCredentials.scala|  4 ++--
 .../main/scala/org/apache/spark/ml/ann/Layer.scala |  6 +++---
 .../org/apache/spark/ml/attribute/attributes.scala |  2 +-
 .../org/apache/spark/ml/stat/Correlation.scala |  2 +-
 .../org/apache/spark/ml/tree/treeParams.scala  | 25 +++---
 .../mllib/stat/test/StreamingTestMethod.scala  |  2 +-
 project/SparkBuild.scala   |  2 +-
 .../apache/spark/sql/hive/client/HiveClient.scala  |  6 +++---
 11 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/rpc/RpcCallContext.scala 
b/core/src/main/scala/org/apache/spark/rpc/RpcCallContext.scala
index 117f51c..f6b2059 100644
--- a/core/src/main/scala/org/apache/spark/rpc/RpcCallContext.scala
+++ b/core/src/main/scala/org/apache/spark/rpc/RpcCallContext.scala
@@ -24,7 +24,7 @@ package org.apache.spark.rpc
 private[spark] trait RpcCallContext {
 
   /**
-   * Reply a message to the sender. If the sender is [[RpcEndpoint]], its 
[[RpcEndpoint.receive]]
+   * Reply a message to the sender. If the sender is [[RpcEndpoint]], its 
`RpcEndpoint.receive`
* will be called.
*/
   def reply(response: Any): Unit
diff --git 
a/core/src/main/scala/org/apache/spark/status/api/v1/ApiRootResource.scala 
b/core/src/main/scala/org/apache/spark/status/api/v1/ApiRootResource.scala
index 84c2ad4..83f76db 100644
--- a/core/src/main/scala/org/apache/spark/status/api/v1/ApiRootResource.scala
+++ b/core/src/main/scala/org/apache/spark/status/api/v1/ApiRootResource.scala
@@ -77,7 +77,7 @@ private[spark] trait UIRoot {
   /**
* Runs some code with the current SparkUI instance for the app / attempt.
*
-   * @throws NoSuchElementException If the app / attempt pair does not exist.
+   * @throws java.util.NoSuchElementException If the app / attempt pair does 
not exist.
*/
   def withSparkUI[T](appId: String, attemptId: Option[String])(fn: SparkUI => 
T): T
 
@@ -85,8 +85,8 @@ private[spark] trait UIRoot {
   def getApplicationInfo(appId: String): Option[ApplicationInfo]
 
   /**
-   * Write the event logs for the given app to the [[ZipOutputStream]] 
instance. If attemptId is
-   * [[None]], event logs for all attempts of this application will be written 
out.
+   * Write the event logs for the given app to the `ZipOutputStream` instance. 
If attemptId is
+   * `None`, event logs for all attempts of this application will be written 
out.
*/
   def writeEventLogs(appId: String, attemptId: Option[String], zipStream: 
ZipOutputStream): Unit = {
 Response.serverError()
diff --git a/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala 
b/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala
index e09f1fc..09c69f5 100644
--- a/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala
+++ b/core/src/main/scala/org/apache/spark/util/SizeEstimator.scala
@@ -34,7 +34,7 @@ import org.apache.spark.util.collection.OpenHashSet
 /**
  * A trait that allows a class to give [[SizeEstimator]] more accurate size 
estimation.
  * When a class extends it, [[SizeEstimator]] will query the `estimatedSize` 
first.
- * If `estimatedSize` does not return [[None]], [[SizeEstimator]] will use the 
returned size
+ * If `estimatedSize` does not return `None`, [[SizeEstimator]] will use the 
returned size
  * as the size of the object. Otherwise, [[SizeEstimator]] will do the 
estimation work.
  * The difference between a [[KnownSizeEstimation]] and
  * [[org.apache.spark.util.collection.SizeTracker]] is that, a
diff --git 
a/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SparkAWSCredentials.scala
 
b/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SparkAWSCredentials.scala
index dcb60b2..7488971 100644
--- 
a/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SparkAWSCredentials.sc

[spark] branch master updated: [SPARK-27512][SQL] Avoid to replace ', ' in CSV's decimal type inference for backward compatibility

2019-04-24 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a30983d  [SPARK-27512][SQL] Avoid to replace ',' in CSV's decimal type 
inference for backward compatibility
a30983d is described below

commit a30983db575de5c87b3a4698b223229327fd65cf
Author: HyukjinKwon 
AuthorDate: Wed Apr 24 16:22:07 2019 +0900

[SPARK-27512][SQL] Avoid to replace ',' in CSV's decimal type inference for 
backward compatibility

## What changes were proposed in this pull request?

The code below currently infers as decimal but previously it was inferred 
as string.

**In branch-2.4**, type inference path for decimal and parsing data are 
different.


https://github.com/apache/spark/blob/2a8343121e62aabe5c69d1e20fbb2c01e2e520e7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L153


https://github.com/apache/spark/blob/c284c4e1f6f684ca8db1cc446fdcc43b46e3413c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L125

So the code below:

```scala
scala> spark.read.option("delimiter", "|").option("inferSchema", 
"true").csv(Seq("1,2").toDS).printSchema()
```

produced string as its type.

```
root
 |-- _c0: string (nullable = true)
```

**In the current master**, it now infers decimal as below:

```
root
 |-- _c0: decimal(2,0) (nullable = true)
```

It happened after https://github.com/apache/spark/pull/22979 because, now 
after this PR, we only have one way to parse decimal:


https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L92

**After the fix:**

```
root
 |-- _c0: string (nullable = true)
```

This PR proposes to restore the previous behaviour back in `CSVInferSchema`.

## How was this patch tested?

Manually tested and unit tests were added.

Closes #24437 from HyukjinKwon/SPARK-27512.

Authored-by: HyukjinKwon 
Signed-off-by: HyukjinKwon 
---
 .../scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala   | 7 ++-
 .../org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala| 4 +++-
 .../org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala  | 7 +++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
index ae9f057..03cc3cb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
@@ -17,6 +17,8 @@
 
 package org.apache.spark.sql.catalyst.csv
 
+import java.util.Locale
+
 import scala.util.control.Exception.allCatch
 
 import org.apache.spark.rdd.RDD
@@ -32,7 +34,10 @@ class CSVInferSchema(val options: CSVOptions) extends 
Serializable {
 options.zoneId,
 options.locale)
 
-  private val decimalParser = {
+  private val decimalParser = if (options.locale == Locale.US) {
+// Special handling the default locale for backward compatibility
+s: String => new java.math.BigDecimal(s)
+  } else {
 ExprUtils.getDecimalParser(options.locale)
   }
 
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala
index c2b525a..24d909e 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala
@@ -185,6 +185,8 @@ class CSVInferSchemaSuite extends SparkFunSuite with 
SQLHelper {
   assert(inferSchema.inferField(NullType, input) == expectedType)
 }
 
-Seq("en-US", "ko-KR", "ru-RU", "de-DE").foreach(checkDecimalInfer(_, 
DecimalType(7, 0)))
+// input like '1,0' is inferred as strings for backward compatibility.
+Seq("en-US").foreach(checkDecimalInfer(_, StringType))
+Seq("ko-KR", "ru-RU", "de-DE").foreach(checkDecimalInfer(_, DecimalType(7, 
0)))
   }
 }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index 6584a30..90deade 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ 
b/sql/c

[spark] branch master updated: [SPARK-27563][SQL][TEST] automatically get the latest Spark versions in HiveExternalCatalogVersionsSuite

2019-04-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2234667  [SPARK-27563][SQL][TEST] automatically get the latest Spark 
versions in HiveExternalCatalogVersionsSuite
2234667 is described below

commit 2234667b159bf19a68758da3ff20cfae3c058c25
Author: Wenchen Fan 
AuthorDate: Fri Apr 26 16:37:43 2019 +0900

[SPARK-27563][SQL][TEST] automatically get the latest Spark versions in 
HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?

We can get the latest downloadable Spark versions from 
https://dist.apache.org/repos/dist/release/spark/

## How was this patch tested?

manually.

Closes #24454 from cloud-fan/test.

Authored-by: Wenchen Fan 
Signed-off-by: HyukjinKwon 
---
 .../sql/hive/HiveExternalCatalogVersionsSuite.scala   | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
index 0a05ec5..ec10295 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
@@ -22,6 +22,7 @@ import java.nio.charset.StandardCharsets
 import java.nio.file.{Files, Paths}
 
 import scala.sys.process._
+import scala.util.control.NonFatal
 
 import org.apache.hadoop.conf.Configuration
 
@@ -169,6 +170,10 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
   """.stripMargin.getBytes("utf8"))
 // scalastyle:on line.size.limit
 
+if (PROCESS_TABLES.testingVersions.isEmpty) {
+  fail("Fail to get the lates Spark versions to test.")
+}
+
 PROCESS_TABLES.testingVersions.zipWithIndex.foreach { case (version, 
index) =>
   val sparkHome = new File(sparkTestingDir, s"spark-$version")
   if (!sparkHome.exists()) {
@@ -206,7 +211,19 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 
 object PROCESS_TABLES extends QueryTest with SQLTestUtils {
   // Tests the latest version of every release line.
-  val testingVersions = Seq("2.3.3", "2.4.2")
+  val testingVersions: Seq[String] = {
+import scala.io.Source
+try {
+  
Source.fromURL("https://dist.apache.org/repos/dist/release/spark/;).mkString
+.split("\n")
+.filter(_.contains("""""".r.findFirstMatchIn(_).get.group(1))
+.filter(_ < org.apache.spark.SPARK_VERSION)
+} catch {
+  // do not throw exception during object initialization.
+  case NonFatal(_) => Nil
+}
+  }
 
   protected var spark: SparkSession = _
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28129][SQL][TEST] Port float8.sql

2019-07-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f74ad3d  [SPARK-28129][SQL][TEST] Port float8.sql
f74ad3d is described below

commit f74ad3d7004e833b6dbc07d6281407ab89ef2d32
Author: Yuming Wang 
AuthorDate: Tue Jul 16 19:31:20 2019 +0900

[SPARK-28129][SQL][TEST] Port float8.sql

## What changes were proposed in this pull request?

This PR is to port float8.sql from PostgreSQL regression tests. 
https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/float8.sql

The expected results can be found in the link: 
https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/float8.out

When porting the test cases, found six PostgreSQL specific features that do 
not exist in Spark SQL:
[SPARK-28060](https://issues.apache.org/jira/browse/SPARK-28060): Double 
type can not accept some special inputs
[SPARK-28027](https://issues.apache.org/jira/browse/SPARK-28027): Spark SQL 
does not support prefix operator `` and `|/`
[SPARK-28061](https://issues.apache.org/jira/browse/SPARK-28061): Support 
for converting float to binary format
[SPARK-23906](https://issues.apache.org/jira/browse/SPARK-23906): Support 
Truncate number
[SPARK-28134](https://issues.apache.org/jira/browse/SPARK-28134): Missing 
Trigonometric Functions

Also, found two bug:
[SPARK-28024](https://issues.apache.org/jira/browse/SPARK-28024): Incorrect 
value when out of range
[SPARK-28135](https://issues.apache.org/jira/browse/SPARK-28135): 
ceil/ceiling/floor/power returns incorrect values

Also, found four inconsistent behavior:
[SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL 
insert bad inputs to NULL
[SPARK-28028](https://issues.apache.org/jira/browse/SPARK-28028): Cast 
numeric to integral type need round
[SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL 
returns NULL when dividing by zero
[SPARK-28007](https://issues.apache.org/jira/browse/SPARK-28007):  Caret 
operator (^) means bitwise XOR in Spark/Hive and exponentiation in Postgres

## How was this patch tested?

N/A

Closes #24931 from wangyum/SPARK-28129.

Authored-by: Yuming Wang 
Signed-off-by: HyukjinKwon 
---
 .../resources/sql-tests/inputs/pgSQL/float8.sql| 500 
 .../sql-tests/results/pgSQL/float8.sql.out | 839 +
 2 files changed, 1339 insertions(+)

diff --git a/sql/core/src/test/resources/sql-tests/inputs/pgSQL/float8.sql 
b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/float8.sql
new file mode 100644
index 000..6f8e3b5
--- /dev/null
+++ b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/float8.sql
@@ -0,0 +1,500 @@
+--
+-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+--
+--
+-- FLOAT8
+-- 
https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/float8.sql
+
+CREATE TABLE FLOAT8_TBL(f1 double) USING parquet;
+
+INSERT INTO FLOAT8_TBL VALUES ('0.0   ');
+INSERT INTO FLOAT8_TBL VALUES ('1004.30  ');
+INSERT INTO FLOAT8_TBL VALUES ('   -34.84');
+INSERT INTO FLOAT8_TBL VALUES ('1.2345678901234e+200');
+INSERT INTO FLOAT8_TBL VALUES ('1.2345678901234e-200');
+
+-- [SPARK-28024] Incorrect numeric values when out of range
+-- test for underflow and overflow handling
+SELECT double('10e400');
+SELECT double('-10e400');
+SELECT double('10e-400');
+SELECT double('-10e-400');
+
+-- [SPARK-28061] Support for converting float to binary format
+-- test smallest normalized input
+-- SELECT float8send('2.2250738585072014E-308'::float8);
+
+-- [SPARK-27923] Spark SQL insert there bad inputs to NULL
+-- bad input
+-- INSERT INTO FLOAT8_TBL VALUES ('');
+-- INSERT INTO FLOAT8_TBL VALUES (' ');
+-- INSERT INTO FLOAT8_TBL VALUES ('xyz');
+-- INSERT INTO FLOAT8_TBL VALUES ('5.0.0');
+-- INSERT INTO FLOAT8_TBL VALUES ('5 . 0');
+-- INSERT INTO FLOAT8_TBL VALUES ('5.   0');
+-- INSERT INTO FLOAT8_TBL VALUES ('- 3');
+-- INSERT INTO FLOAT8_TBL VALUES ('123   5');
+
+-- special inputs
+SELECT double('NaN');
+-- [SPARK-28060] Double type can not accept some special inputs
+SELECT double('nan');
+SELECT double('   NAN  ');
+SELECT double('infinity');
+SELECT double('  -INFINiTY   ');
+-- [SPARK-27923] Spark SQL insert there bad special inputs to NULL
+-- bad special inputs
+SELECT double('N A N');
+SELECT double('NaN x');
+SELECT double(' INFINITYx');
+
+SELECT double('Infinity') + 100.0;
+-- [SPARK-27768] Infinity, -Infinity, NaN should be recognized in a case 
insensitive manner
+SELECT double('Infinity') / double('Infinity');
+SELECT double('NaN') / double('NaN');
+-- [SPARK-28315] Decimal can not accept NaN as input
+SELECT double(decimal('nan

[spark] branch master updated: [SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles

2019-06-25 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d83f84a  [SPARK-27676][SQL][SS] InMemoryFileIndex should respect 
spark.sql.files.ignoreMissingFiles
d83f84a is described below

commit d83f84a1229138b7935b4b18da80a96d3c5c3dde
Author: Josh Rosen 
AuthorDate: Wed Jun 26 09:11:28 2019 +0900

[SPARK-27676][SQL][SS] InMemoryFileIndex should respect 
spark.sql.files.ignoreMissingFiles

## What changes were proposed in this pull request?

Spark's `InMemoryFileIndex` contains two places where `FileNotFound` 
exceptions are caught and logged as warnings (during [directory 
listing](https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274)
 and [block location 
lookup](https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources
 [...]

I think that this is a dangerous default behavior because it can mask bugs 
caused by race conditions (e.g. overwriting a table while it's being read) or 
S3 consistency issues (there's more discussion on this in the [JIRA 
ticket](https://issues.apache.org/jira/browse/SPARK-27676)). Failing fast when 
we detect missing files is not sufficient to make concurrent table reads/writes 
or S3 listing safe (there are other classes of eventual consistency issues to 
worry about), but I think it's  [...]

There may be some cases where users _do_ want to ignore missing files, but 
I think that should be an opt-in behavior via the existing 
`spark.sql.files.ignoreMissingFiles` flag (the current behavior is itself 
race-prone because a file might be be deleted between catalog listing and query 
execution time, triggering FileNotFoundExceptions on executors (which are 
handled in a way that _does_ respect `ignoreMissingFIles`)).

This PR updates `InMemoryFileIndex` to guard the 
log-and-ignore-FileNotFoundException behind the existing 
`spark.sql.files.ignoreMissingFiles` flag.

**Note**: this is a change of default behavior, so I think it needs to be 
mentioned in release notes.

## How was this patch tested?

New unit tests to simulate file-deletion race conditions, tested with both 
values of the `ignoreMissingFIles` flag.

Closes #24668 from JoshRosen/SPARK-27676.

Lead-authored-by: Josh Rosen 
Co-authored-by: Josh Rosen 
Signed-off-by: HyukjinKwon 
---
 docs/sql-migration-guide-upgrade.md|   2 +
 .../spark/sql/execution/command/CommandUtils.scala |   2 +-
 .../execution/datasources/InMemoryFileIndex.scala  |  71 ++--
 .../sql/execution/datasources/FileIndexSuite.scala | 126 -
 4 files changed, 188 insertions(+), 13 deletions(-)

diff --git a/docs/sql-migration-guide-upgrade.md 
b/docs/sql-migration-guide-upgrade.md
index b062a04..c920f06 100644
--- a/docs/sql-migration-guide-upgrade.md
+++ b/docs/sql-migration-guide-upgrade.md
@@ -145,6 +145,8 @@ license: |
 
   - Since Spark 3.0, a higher-order function `exists` follows the three-valued 
boolean logic, i.e., if the `predicate` returns any `null`s and no `true` is 
obtained, then `exists` will return `null` instead of `false`. For example, 
`exists(array(1, null, 3), x -> x % 2 == 0)` will be `null`. The previous 
behaviour can be restored by setting 
`spark.sql.legacy.arrayExistsFollowsThreeValuedLogic` to `false`.
 
+  - Since Spark 3.0, if files or subdirectories disappear during recursive 
directory listing (i.e. they appear in an intermediate listing but then cannot 
be read or listed during later phases of the recursive directory listing, due 
to either concurrent file deletions or object store consistency issues) then 
the listing will fail with an exception unless 
`spark.sql.files.ignoreMissingFiles` is `true` (default `false`). In previous 
versions, these missing files or subdirectories would be i [...]
+
 ## Upgrading from Spark SQL 2.4 to 2.4.1
 
   - The value of `spark.executor.heartbeatInterval`, when specified without 
units like "30" rather than "30s", was
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
index cac2519..b644e6d 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
@@ -67,7 +67,7 @@ object CommandUtils extends Logging {
   override def accept(path: Path): Boolean = isDataPath(path, 
stagingDir)
 }
 val fileStatusSeq = InMemoryFileIndex.bulkListLeaf

[spark] branch master updated: [SPARK-28054][SQL][FOLLOW-UP] Fix error when insert Hive partitioned table dynamically where partition name is upper case

2019-07-01 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f148674  [SPARK-28054][SQL][FOLLOW-UP] Fix error when insert Hive 
partitioned table dynamically where partition name is upper case
f148674 is described below

commit f1486742fa032d838c0730fcb968e42ac145acc8
Author: Liang-Chi Hsieh 
AuthorDate: Tue Jul 2 14:57:24 2019 +0900

[SPARK-28054][SQL][FOLLOW-UP] Fix error when insert Hive partitioned table 
dynamically where partition name is upper case

## What changes were proposed in this pull request?

This is a small follow-up for SPARK-28054 to fix wrong indent and use 
`withSQLConf` as suggested by gatorsmile.

## How was this patch tested?

Existing tests.

Closes #24971 from viirya/SPARK-28054-followup.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: HyukjinKwon 
---
 .../spark/sql/hive/execution/SaveAsHiveFile.scala   |  2 +-
 .../spark/sql/hive/execution/HiveQuerySuite.scala   | 21 +++--
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
index 234acb7..62d3bad 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala
@@ -89,7 +89,7 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand 
{
 // we also need to lowercase the column names in written partition paths.
 // scalastyle:off caselocale
 val hiveCompatiblePartitionColumns = partitionAttributes.map { attr =>
- attr.withName(attr.name.toLowerCase)
+  attr.withName(attr.name.toLowerCase)
 }
 // scalastyle:on caselocale
 
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
index 13a533c..6986963 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
@@ -1190,19 +1190,20 @@ class HiveQuerySuite extends HiveComparisonTest with 
SQLTestUtils with BeforeAnd
   }
 
   test("SPARK-28054: Unable to insert partitioned table when partition name is 
upper case") {
-withTable("spark_28054_test") {
-  sql("set hive.exec.dynamic.partition.mode=nonstrict")
-  sql("CREATE TABLE spark_28054_test (KEY STRING, VALUE STRING) 
PARTITIONED BY (DS STRING)")
+withSQLConf("hive.exec.dynamic.partition.mode" -> "nonstrict") {
+  withTable("spark_28054_test") {
+sql("CREATE TABLE spark_28054_test (KEY STRING, VALUE STRING) 
PARTITIONED BY (DS STRING)")
 
-  sql("INSERT INTO TABLE spark_28054_test PARTITION(DS) SELECT 'k' KEY, 
'v' VALUE, '1' DS")
+sql("INSERT INTO TABLE spark_28054_test PARTITION(DS) SELECT 'k' KEY, 
'v' VALUE, '1' DS")
 
-  assertResult(Array(Row("k", "v", "1"))) {
-sql("SELECT * from spark_28054_test").collect()
-  }
+assertResult(Array(Row("k", "v", "1"))) {
+  sql("SELECT * from spark_28054_test").collect()
+}
 
-  sql("INSERT INTO TABLE spark_28054_test PARTITION(ds) SELECT 'k' key, 
'v' value, '2' ds")
-  assertResult(Array(Row("k", "v", "1"), Row("k", "v", "2"))) {
-sql("SELECT * from spark_28054_test").collect()
+sql("INSERT INTO TABLE spark_28054_test PARTITION(ds) SELECT 'k' key, 
'v' value, '2' ds")
+assertResult(Array(Row("k", "v", "1"), Row("k", "v", "2"))) {
+  sql("SELECT * from spark_28054_test").collect()
+}
   }
 }
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (0a4f985c -> 02f4763)

2019-07-01 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 0a4f985c [SPARK-23098][SQL] Migrate Kafka Batch source to v2.
 add 02f4763  [SPARK-28198][PYTHON] Add mapPartitionsInPandas to allow an 
iterator of DataFrames

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/api/python/PythonRunner.scala |   2 +
 python/pyspark/rdd.py  |   1 +
 python/pyspark/sql/dataframe.py|  48 +++-
 python/pyspark/sql/tests/test_pandas_udf_iter.py   | 135 +
 python/pyspark/worker.py   |  62 ++
 .../plans/logical/pythonLogicalOperators.scala |  12 ++
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  21 +++-
 .../spark/sql/execution/SparkStrategies.scala  |   2 +
 .../python/MapPartitionsInPandasExec.scala |  95 +++
 9 files changed, 353 insertions(+), 25 deletions(-)
 create mode 100644 python/pyspark/sql/tests/test_pandas_udf_iter.py
 create mode 100644 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/MapPartitionsInPandasExec.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix documentation

2019-06-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 048224c  [SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix 
documentation
048224c is described below

commit 048224ce9a3bdb304ba24852ecc66c7f14c25c11
Author: Marco Gaido 
AuthorDate: Mon Jul 1 11:40:12 2019 +0900

[SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix documentation

## What changes were proposed in this pull request?

The documentation in `linalg.py` is not consistent. This PR uniforms the 
documentation.

## How was this patch tested?

NA

Closes #25011 from mgaido91/SPARK-28170.

Authored-by: Marco Gaido 
Signed-off-by: HyukjinKwon 
---
 python/pyspark/ml/linalg/__init__.py | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/ml/linalg/__init__.py 
b/python/pyspark/ml/linalg/__init__.py
index f99161c..f6ddc09 100644
--- a/python/pyspark/ml/linalg/__init__.py
+++ b/python/pyspark/ml/linalg/__init__.py
@@ -386,14 +386,14 @@ class DenseVector(Vector):
 
 def toArray(self):
 """
-Returns an numpy.ndarray
+Returns the underlying numpy.ndarray
 """
 return self.array
 
 @property
 def values(self):
 """
-Returns a list of values
+Returns the underlying numpy.ndarray
 """
 return self.array
 
@@ -681,7 +681,7 @@ class SparseVector(Vector):
 
 def toArray(self):
 """
-Returns a copy of this SparseVector as a 1-dimensional NumPy array.
+Returns a copy of this SparseVector as a 1-dimensional numpy.ndarray.
 """
 arr = np.zeros((self.size,), dtype=np.float64)
 arr[self.indices] = self.values
@@ -862,7 +862,7 @@ class Matrix(object):
 
 def toArray(self):
 """
-Returns its elements in a NumPy ndarray.
+Returns its elements in a numpy.ndarray.
 """
 raise NotImplementedError
 
@@ -937,7 +937,7 @@ class DenseMatrix(Matrix):
 
 def toArray(self):
 """
-Return an numpy.ndarray
+Return a numpy.ndarray
 
 >>> m = DenseMatrix(2, 2, range(4))
 >>> m.toArray()
@@ -1121,7 +1121,7 @@ class SparseMatrix(Matrix):
 
 def toArray(self):
 """
-Return an numpy.ndarray
+Return a numpy.ndarray
 """
 A = np.zeros((self.numRows, self.numCols), dtype=np.float64, order='F')
 for k in xrange(self.colPtrs.size - 1):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix documentation

2019-06-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new d57b392  [SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix 
documentation
d57b392 is described below

commit d57b392961167bb8dfd888d79efa947277d62ec9
Author: Marco Gaido 
AuthorDate: Mon Jul 1 11:40:12 2019 +0900

[SPARK-28170][ML][PYTHON] Uniform Vectors and Matrix documentation

## What changes were proposed in this pull request?

The documentation in `linalg.py` is not consistent. This PR uniforms the 
documentation.

## How was this patch tested?

NA

Closes #25011 from mgaido91/SPARK-28170.

Authored-by: Marco Gaido 
Signed-off-by: HyukjinKwon 
(cherry picked from commit 048224ce9a3bdb304ba24852ecc66c7f14c25c11)
Signed-off-by: HyukjinKwon 
---
 python/pyspark/ml/linalg/__init__.py | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/ml/linalg/__init__.py 
b/python/pyspark/ml/linalg/__init__.py
index 9da9836..9261d65 100644
--- a/python/pyspark/ml/linalg/__init__.py
+++ b/python/pyspark/ml/linalg/__init__.py
@@ -386,14 +386,14 @@ class DenseVector(Vector):
 
 def toArray(self):
 """
-Returns an numpy.ndarray
+Returns the underlying numpy.ndarray
 """
 return self.array
 
 @property
 def values(self):
 """
-Returns a list of values
+Returns the underlying numpy.ndarray
 """
 return self.array
 
@@ -681,7 +681,7 @@ class SparseVector(Vector):
 
 def toArray(self):
 """
-Returns a copy of this SparseVector as a 1-dimensional NumPy array.
+Returns a copy of this SparseVector as a 1-dimensional numpy.ndarray.
 """
 arr = np.zeros((self.size,), dtype=np.float64)
 arr[self.indices] = self.values
@@ -862,7 +862,7 @@ class Matrix(object):
 
 def toArray(self):
 """
-Returns its elements in a NumPy ndarray.
+Returns its elements in a numpy.ndarray.
 """
 raise NotImplementedError
 
@@ -937,7 +937,7 @@ class DenseMatrix(Matrix):
 
 def toArray(self):
 """
-Return an numpy.ndarray
+Return a numpy.ndarray
 
 >>> m = DenseMatrix(2, 2, range(4))
 >>> m.toArray()
@@ -1121,7 +1121,7 @@ class SparseMatrix(Matrix):
 
 def toArray(self):
 """
-Return an numpy.ndarray
+Return a numpy.ndarray
 """
 A = np.zeros((self.numRows, self.numCols), dtype=np.float64, order='F')
 for k in xrange(self.colPtrs.size - 1):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28215][SQL][R] as_tibble was removed from Arrow R API

2019-06-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7083ec0  [SPARK-28215][SQL][R] as_tibble was removed from Arrow R API
7083ec0 is described below

commit 7083ec051ed47b9c6500f2107650814f0ff9206f
Author: Liang-Chi Hsieh 
AuthorDate: Mon Jul 1 13:21:06 2019 +0900

[SPARK-28215][SQL][R] as_tibble was removed from Arrow R API

## What changes were proposed in this pull request?

New R api of Arrow has removed `as_tibble` as of 
https://github.com/apache/arrow/commit/2ef96c8623cbad1770f82e97df733bd881ab967b.
 Arrow optimization for DataFrame in R doesn't work due to the change.

This can be tested as below, after installing latest Arrow:

```
./bin/sparkR --conf spark.sql.execution.arrow.sparkr.enabled=true
```

```
> collect(createDataFrame(mtcars))
```

Before this PR:
```
> collect(createDataFrame(mtcars))
 Error in get("as_tibble", envir = asNamespace("arrow")) :
   object 'as_tibble' not found
```

After:
```
> collect(createDataFrame(mtcars))
mpg cyl  disp  hp dratwt  qsec vs am gear carb
1  21.0   6 160.0 110 3.90 2.620 16.46  0  144
2  21.0   6 160.0 110 3.90 2.875 17.02  0  144
3  22.8   4 108.0  93 3.85 2.320 18.61  1  141
...
```

## How was this patch tested?

Manual test.

Closes #25012 from viirya/SPARK-28215.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: HyukjinKwon 
---
 R/pkg/R/DataFrame.R   | 10 --
 R/pkg/R/deserialize.R | 13 ++---
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 439cad0..6f3c7c1 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -1203,7 +1203,8 @@ setMethod("collect",
   requireNamespace1 <- requireNamespace
   if (requireNamespace1("arrow", quietly = TRUE)) {
 read_arrow <- get("read_arrow", envir = asNamespace("arrow"), 
inherits = FALSE)
-as_tibble <- get("as_tibble", envir = asNamespace("arrow"))
+# Arrow drops `as_tibble` since 0.14.0, see ARROW-5190.
+useAsTibble <- exists("as_tibble", envir = 
asNamespace("arrow"))
 
 portAuth <- callJMethod(x@sdf, "collectAsArrowToR")
 port <- portAuth[[1]]
@@ -1213,7 +1214,12 @@ setMethod("collect",
 output <- tryCatch({
   doServerAuth(conn, authSecret)
   arrowTable <- read_arrow(readRaw(conn))
-  as.data.frame(as_tibble(arrowTable), stringsAsFactors = 
stringsAsFactors)
+  if (useAsTibble) {
+as_tibble <- get("as_tibble", envir = asNamespace("arrow"))
+as.data.frame(as_tibble(arrowTable), stringsAsFactors = 
stringsAsFactors)
+  } else {
+as.data.frame(arrowTable, stringsAsFactors = 
stringsAsFactors)
+  }
 }, finally = {
   close(conn)
 })
diff --git a/R/pkg/R/deserialize.R b/R/pkg/R/deserialize.R
index 191c51e..b38d245 100644
--- a/R/pkg/R/deserialize.R
+++ b/R/pkg/R/deserialize.R
@@ -237,7 +237,9 @@ readDeserializeInArrow <- function(inputCon) {
   if (requireNamespace1("arrow", quietly = TRUE)) {
 RecordBatchStreamReader <- get(
   "RecordBatchStreamReader", envir = asNamespace("arrow"), inherits = 
FALSE)
-as_tibble <- get("as_tibble", envir = asNamespace("arrow"))
+# Arrow drops `as_tibble` since 0.14.0, see ARROW-5190.
+useAsTibble <- exists("as_tibble", envir = asNamespace("arrow"))
+
 
 # Currently, there looks no way to read batch by batch by socket 
connection in R side,
 # See ARROW-4512. Therefore, it reads the whole Arrow streaming-formatted 
binary at once
@@ -246,8 +248,13 @@ readDeserializeInArrow <- function(inputCon) {
 arrowData <- readBin(inputCon, raw(), as.integer(dataLen), endian = "big")
 batches <- RecordBatchStreamReader(arrowData)$batches()
 
-# Read all groupped batches. Tibble -> data.frame is cheap.
-lapply(batches, function(batch) as.data.frame(as_tibble(batch)))
+if (useAsTibble) {
+  as_tibble <- get("as_tibble", envir = asNamespace("arrow"))
+  # Read all groupped batches. Tibble -> data.frame is cheap.
+  lapply(batches, function(batch) as.data.frame(as_tibble(batch)))
+} else {
+  lapply(batches, function(batch) as.data.frame(batch))
+}
   } else {
 stop("'arrow' package should be installed.")
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28204][SQL][TESTS] Make separate two test cases for column pruning in binary files

2019-06-28 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new facf9c3  [SPARK-28204][SQL][TESTS] Make separate two test cases for 
column pruning in binary files
facf9c3 is described below

commit facf9c30a283ec682b5adb2e7afdbf5d011e3808
Author: HyukjinKwon 
AuthorDate: Sat Jun 29 14:05:23 2019 +0900

[SPARK-28204][SQL][TESTS] Make separate two test cases for column pruning 
in binary files

## What changes were proposed in this pull request?

SPARK-27534 missed to address my own comments at 
https://github.com/WeichenXu123/spark/pull/8
It's better to push this in since the codes are already cleaned up.

## How was this patch tested?

Unittests fixed

Closes #25003 from HyukjinKwon/SPARK-27534.

Authored-by: HyukjinKwon 
Signed-off-by: HyukjinKwon 
---
 .../binaryfile/BinaryFileFormatSuite.scala | 88 +++---
 1 file changed, 43 insertions(+), 45 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala
index 9e2969b..a66b34f 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormatSuite.scala
@@ -290,56 +290,54 @@ class BinaryFileFormatSuite extends QueryTest with 
SharedSQLContext with SQLTest
 ), true)
   }
 
+  private def readBinaryFile(file: File, requiredSchema: StructType): Row = {
+val format = new BinaryFileFormat
+val reader = format.buildReaderWithPartitionValues(
+  sparkSession = spark,
+  dataSchema = schema,
+  partitionSchema = StructType(Nil),
+  requiredSchema = requiredSchema,
+  filters = Seq.empty,
+  options = Map.empty,
+  hadoopConf = spark.sessionState.newHadoopConf()
+)
+val partitionedFile = mock(classOf[PartitionedFile])
+when(partitionedFile.filePath).thenReturn(file.getPath)
+val encoder = RowEncoder(requiredSchema).resolveAndBind()
+encoder.fromRow(reader(partitionedFile).next())
+  }
+
   test("column pruning") {
-def getRequiredSchema(fieldNames: String*): StructType = {
-  StructType(fieldNames.map {
-case f if schema.fieldNames.contains(f) => schema(f)
-case other => StructField(other, NullType)
-  })
-}
-def read(file: File, requiredSchema: StructType): Row = {
-  val format = new BinaryFileFormat
-  val reader = format.buildReaderWithPartitionValues(
-sparkSession = spark,
-dataSchema = schema,
-partitionSchema = StructType(Nil),
-requiredSchema = requiredSchema,
-filters = Seq.empty,
-options = Map.empty,
-hadoopConf = spark.sessionState.newHadoopConf()
-  )
-  val partitionedFile = mock(classOf[PartitionedFile])
-  when(partitionedFile.filePath).thenReturn(file.getPath)
-  val encoder = RowEncoder(requiredSchema).resolveAndBind()
-  encoder.fromRow(reader(partitionedFile).next())
-}
-val file = new File(Utils.createTempDir(), "data")
-val content = "123".getBytes
-Files.write(file.toPath, content, StandardOpenOption.CREATE, 
StandardOpenOption.WRITE)
-
-read(file, getRequiredSchema(MODIFICATION_TIME, CONTENT, LENGTH, PATH)) 
match {
-  case Row(t, c, len, p) =>
-assert(t === new Timestamp(file.lastModified()))
-assert(c === content)
-assert(len === content.length)
-assert(p.asInstanceOf[String].endsWith(file.getAbsolutePath))
+withTempPath { file =>
+  val content = "123".getBytes
+  Files.write(file.toPath, content, StandardOpenOption.CREATE, 
StandardOpenOption.WRITE)
+
+  val actual = readBinaryFile(file, StructType(schema.takeRight(3)))
+  val expected = Row(new Timestamp(file.lastModified()), content.length, 
content)
+
+  assert(actual === expected)
 }
-file.setReadable(false)
-withClue("cannot read content") {
+  }
+
+  test("column pruning - non-readable file") {
+withTempPath { file =>
+  val content = "abc".getBytes
+  Files.write(file.toPath, content, StandardOpenOption.CREATE, 
StandardOpenOption.WRITE)
+  file.setReadable(false)
+
+  // If content is selected, it throws an exception because it's not 
readable.
   intercept[IOException] {
-read(file, getRequiredSchema(CONTENT))
+readBinaryFile(file, StructType(schema(CONTENT) :: Nil))
   }
-}
-assert(read(file, getRequiredSchema(LENGTH)) === Row(content.length),
-  "

[spark] branch master updated: [SPARK-28422][SQL][PYTHON] GROUPED_AGG pandas_udf should work without group by clause

2019-08-13 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e6a0385  [SPARK-28422][SQL][PYTHON] GROUPED_AGG pandas_udf should work 
without group by clause
e6a0385 is described below

commit e6a0385289f2d2fec05d3fb5f798903de292c381
Author: Liang-Chi Hsieh 
AuthorDate: Wed Aug 14 00:32:33 2019 +0900

[SPARK-28422][SQL][PYTHON] GROUPED_AGG pandas_udf should work without group 
by clause

## What changes were proposed in this pull request?

A GROUPED_AGG pandas python udf can't work, if without group by clause, 
like `select udf(id) from table`.

This doesn't match with aggregate function like sum, count..., and also 
dataset API like `df.agg(udf(df['id']))`.

When we parse a udf (or an aggregate function) like that from SQL syntax, 
it is known as a function in a project. `GlobalAggregates` rule in analysis 
makes such project as aggregate, by looking for aggregate expressions. At the 
moment, we should also look for GROUPED_AGG pandas python udf.

## How was this patch tested?

Added tests.

Closes #25352 from viirya/SPARK-28422.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: HyukjinKwon 
---
 .../sql/tests/test_pandas_udf_grouped_agg.py   | 15 ++
 .../spark/sql/catalyst/analysis/Analyzer.scala |  5 -
 .../apache/spark/sql/catalyst/plans/PlanTest.scala |  2 ++
 .../python/BatchEvalPythonExecSuite.scala  | 24 +-
 4 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/tests/test_pandas_udf_grouped_agg.py 
b/python/pyspark/sql/tests/test_pandas_udf_grouped_agg.py
index 041b2b5..6d460df 100644
--- a/python/pyspark/sql/tests/test_pandas_udf_grouped_agg.py
+++ b/python/pyspark/sql/tests/test_pandas_udf_grouped_agg.py
@@ -474,6 +474,21 @@ class GroupedAggPandasUDFTests(ReusedSQLTestCase):
 result = df.groupBy('id').agg(f(df['x']).alias('sum')).collect()
 self.assertEqual(result, expected)
 
+def test_grouped_without_group_by_clause(self):
+@pandas_udf('double', PandasUDFType.GROUPED_AGG)
+def max_udf(v):
+return v.max()
+
+df = self.spark.range(0, 100)
+self.spark.udf.register('max_udf', max_udf)
+
+with self.tempView("table"):
+df.createTempView('table')
+
+agg1 = df.agg(max_udf(df['id']))
+agg2 = self.spark.sql("select max_udf(id) from table")
+assert_frame_equal(agg1.toPandas(), agg2.toPandas())
+
 
 if __name__ == "__main__":
 from pyspark.sql.tests.test_pandas_udf_grouped_agg import *
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index f8eef0c..5a04d57 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1799,15 +1799,18 @@ class Analyzer(
 
 def containsAggregates(exprs: Seq[Expression]): Boolean = {
   // Collect all Windowed Aggregate Expressions.
-  val windowedAggExprs = exprs.flatMap { expr =>
+  val windowedAggExprs: Set[Expression] = exprs.flatMap { expr =>
 expr.collect {
   case WindowExpression(ae: AggregateExpression, _) => ae
+  case WindowExpression(e: PythonUDF, _) if 
PythonUDF.isGroupedAggPandasUDF(e) => e
 }
   }.toSet
 
   // Find the first Aggregate Expression that is not Windowed.
   exprs.exists(_.collectFirst {
 case ae: AggregateExpression if !windowedAggExprs.contains(ae) => ae
+case e: PythonUDF if PythonUDF.isGroupedAggPandasUDF(e) &&
+  !windowedAggExprs.contains(e) => e
   }.isDefined)
 }
   }
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala
index 6e2a842..08f1f87 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala
@@ -81,6 +81,8 @@ trait PlanTestBase extends PredicateHelper with SQLHelper { 
self: Suite =>
 ae.copy(resultId = ExprId(0))
   case lv: NamedLambdaVariable =>
 lv.copy(exprId = ExprId(0), value = null)
+  case udf: PythonUDF =>
+udf.copy(resultId = ExprId(0))
 }
   }
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExecSuite.scala
index 289cc66..ac5752b 100644
---

[spark] branch master updated: [SPARK-28556][SQL] QueryExecutionListener should also notify Error

2019-07-29 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 196a4d7  [SPARK-28556][SQL] QueryExecutionListener should also notify 
Error
196a4d7 is described below

commit 196a4d7117b122fdb1fc95241613d012e8292734
Author: Shixiong Zhu 
AuthorDate: Tue Jul 30 11:47:36 2019 +0900

[SPARK-28556][SQL] QueryExecutionListener should also notify Error

## What changes were proposed in this pull request?

Right now `Error` is not sent to `QueryExecutionListener.onFailure`. If 
there is any `Error` (such as `AssertionError`) when running a query, 
`QueryExecutionListener.onFailure` cannot be triggered.

This PR changes `onFailure` to accept a `Throwable` instead.

## How was this patch tested?

Jenkins

Closes #25292 from zsxwing/fix-QueryExecutionListener.

Authored-by: Shixiong Zhu 
Signed-off-by: HyukjinKwon 
---
 project/MimaExcludes.scala |  6 +-
 .../apache/spark/sql/execution/SQLExecution.scala  |  4 ++--
 .../spark/sql/execution/ui/SQLListener.scala   |  2 +-
 .../spark/sql/util/QueryExecutionListener.scala|  4 ++--
 .../org/apache/spark/sql/SessionStateSuite.scala   |  2 +-
 .../spark/sql/TestQueryExecutionListener.scala |  2 +-
 .../test/scala/org/apache/spark/sql/UDFSuite.scala |  2 +-
 .../sources/v2/FileDataSourceV2FallBackSuite.scala |  6 +++---
 .../sql/test/DataFrameReaderWriterSuite.scala  |  2 +-
 .../spark/sql/util/DataFrameCallbackSuite.scala| 24 +++---
 .../sql/util/ExecutionListenerManagerSuite.scala   |  2 +-
 11 files changed, 30 insertions(+), 26 deletions(-)

diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index 5978f88..51d5861 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -376,7 +376,11 @@ object MimaExcludes {
 
 // [SPARK-28199][SS] Remove deprecated ProcessingTime
 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.streaming.ProcessingTime"),
-
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.streaming.ProcessingTime$")
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.streaming.ProcessingTime$"),
+
+// [SPARK-28556][SQL] QueryExecutionListener should also notify Error
+
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.util.QueryExecutionListener.onFailure"),
+
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.sql.util.QueryExecutionListener.onFailure")
   )
 
   // Exclude rules for 2.4.x
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala
index ca66337..6046805 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala
@@ -85,7 +85,7 @@ object SQLExecution {
 }.getOrElse(callSite.shortForm)
 
   withSQLConfPropagated(sparkSession) {
-var ex: Option[Exception] = None
+var ex: Option[Throwable] = None
 val startTime = System.nanoTime()
 try {
   sc.listenerBus.post(SparkListenerSQLExecutionStart(
@@ -99,7 +99,7 @@ object SQLExecution {
 time = System.currentTimeMillis()))
   body
 } catch {
-  case e: Exception =>
+  case e: Throwable =>
 ex = Some(e)
 throw e
 } finally {
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala
index 67d1f27..81cbc7f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala
@@ -60,7 +60,7 @@ case class SparkListenerSQLExecutionEnd(executionId: Long, 
time: Long)
   @JsonIgnore private[sql] var qe: QueryExecution = null
 
   // The exception object that caused this execution to fail. None if the 
execution doesn't fail.
-  @JsonIgnore private[sql] var executionFailure: Option[Exception] = None
+  @JsonIgnore private[sql] var executionFailure: Option[Throwable] = None
 }
 
 /**
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala
index 77ae047..2da8469 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala
@@ -58,12 +58,12 @@ trait QueryExecutionListener {
* @param funcName the name

[spark] branch branch-2.3 updated: [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7

2019-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.3 by this push:
 new e686178  [SPARK-28582][PYTHON] Fix flaky test 
DaemonTests.do_termination_test which fail on Python 3.7
e686178 is described below

commit e686178ffe7ceb33ee42558ee0b4fe28c417124d
Author: WeichenXu 
AuthorDate: Sat Aug 3 10:31:15 2019 +0900

[SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which 
fail on Python 3.7

This PR picks up https://github.com/apache/spark/pull/25315 back after 
removing `Popen.wait` usage which exists in Python 3 only. I saw the last test 
results wrongly and thought it was passed.

Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I 
add a sleep after the test connection to daemon.

Run test
```
python/run-tests --python-executables=python3.7 --testname 
"pyspark.tests.test_daemon DaemonTests"
```
**Before**
Fail on test "test_termination_sigterm". And we can see daemon process do 
not exit.
**After**
Test passed

Closes #25343 from HyukjinKwon/SPARK-28582.

Authored-by: WeichenXu 
Signed-off-by: HyukjinKwon 
(cherry picked from commit b3394db1930b3c9f55438cb27bb2c584bf041f8e)
Signed-off-by: HyukjinKwon 
---
 python/pyspark/tests.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 5ba121f..e1c780f 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -1862,6 +1862,9 @@ class DaemonTests(unittest.TestCase):
 # daemon should accept connections
 self.assertTrue(self.connect(port))
 
+# wait worker process spawned from daemon exit.
+time.sleep(1)
+
 # request shutdown
 terminator(daemon)
 time.sleep(1)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7

2019-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b3394db  [SPARK-28582][PYTHON] Fix flaky test 
DaemonTests.do_termination_test which fail on Python 3.7
b3394db is described below

commit b3394db1930b3c9f55438cb27bb2c584bf041f8e
Author: WeichenXu 
AuthorDate: Sat Aug 3 10:31:15 2019 +0900

[SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which 
fail on Python 3.7

## What changes were proposed in this pull request?

This PR picks up https://github.com/apache/spark/pull/25315 back after 
removing `Popen.wait` usage which exists in Python 3 only. I saw the last test 
results wrongly and thought it was passed.

Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I 
add a sleep after the test connection to daemon.

## How was this patch tested?

Run test
```
python/run-tests --python-executables=python3.7 --testname 
"pyspark.tests.test_daemon DaemonTests"
```
**Before**
Fail on test "test_termination_sigterm". And we can see daemon process do 
not exit.
**After**
Test passed

Closes #25343 from HyukjinKwon/SPARK-28582.

Authored-by: WeichenXu 
Signed-off-by: HyukjinKwon 
---
 python/pyspark/tests/test_daemon.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/python/pyspark/tests/test_daemon.py 
b/python/pyspark/tests/test_daemon.py
index 2cdc16c..898fb39 100644
--- a/python/pyspark/tests/test_daemon.py
+++ b/python/pyspark/tests/test_daemon.py
@@ -47,6 +47,9 @@ class DaemonTests(unittest.TestCase):
 # daemon should accept connections
 self.assertTrue(self.connect(port))
 
+# wait worker process spawned from daemon exit.
+time.sleep(1)
+
 # request shutdown
 terminator(daemon)
 time.sleep(1)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7

2019-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 6c61321  [SPARK-28582][PYTHON] Fix flaky test 
DaemonTests.do_termination_test which fail on Python 3.7
6c61321 is described below

commit 6c613210cd1075f771376e7ecc6dfb08bd620716
Author: WeichenXu 
AuthorDate: Sat Aug 3 10:31:15 2019 +0900

[SPARK-28582][PYTHON] Fix flaky test DaemonTests.do_termination_test which 
fail on Python 3.7

This PR picks up https://github.com/apache/spark/pull/25315 back after 
removing `Popen.wait` usage which exists in Python 3 only. I saw the last test 
results wrongly and thought it was passed.

Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I 
add a sleep after the test connection to daemon.

Run test
```
python/run-tests --python-executables=python3.7 --testname 
"pyspark.tests.test_daemon DaemonTests"
```
**Before**
Fail on test "test_termination_sigterm". And we can see daemon process do 
not exit.
**After**
Test passed

Closes #25343 from HyukjinKwon/SPARK-28582.

Authored-by: WeichenXu 
Signed-off-by: HyukjinKwon 
(cherry picked from commit b3394db1930b3c9f55438cb27bb2c584bf041f8e)
Signed-off-by: HyukjinKwon 
---
 python/pyspark/tests.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index a2d825b..26c9126 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -1939,6 +1939,9 @@ class DaemonTests(unittest.TestCase):
 # daemon should accept connections
 self.assertTrue(self.connect(port))
 
+# wait worker process spawned from daemon exit.
+time.sleep(1)
+
 # request shutdown
 terminator(daemon)
 time.sleep(1)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7

2019-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new dc09a02  [SPARK-28582][PYSPARK] Fix flaky test 
DaemonTests.do_termination_test which fail on Python 3.7
dc09a02 is described below

commit dc09a02c142d3787e728c8b25eb8417649d98e9f
Author: WeichenXu 
AuthorDate: Fri Aug 2 22:07:06 2019 +0900

[SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which 
fail on Python 3.7

Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I 
add a sleep after the test connection to daemon.

Run test
```
python/run-tests --python-executables=python3.7 --testname 
"pyspark.tests.test_daemon DaemonTests"
```
**Before**
Fail on test "test_termination_sigterm". And we can see daemon process do 
not exit.
**After**
Test passed

Closes #25315 from WeichenXu123/fix_py37_daemon.

Authored-by: WeichenXu 
Signed-off-by: HyukjinKwon 
(cherry picked from commit fbeee0c5bcea32346b2279c5b67044f12e5faf7d)
Signed-off-by: HyukjinKwon 
---
 python/pyspark/tests.py | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index a2d825b..fef2959 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -1916,7 +1916,7 @@ class OutputFormatTests(ReusedPySparkTestCase):
 
 class DaemonTests(unittest.TestCase):
 def connect(self, port):
-from socket import socket, AF_INET, SOCK_STREAM
+from socket import socket, AF_INET, SOCK_STREAM# request shutdown
 sock = socket(AF_INET, SOCK_STREAM)
 sock.connect(('127.0.0.1', port))
 # send a split index of -1 to shutdown the worker
@@ -1939,9 +1939,12 @@ class DaemonTests(unittest.TestCase):
 # daemon should accept connections
 self.assertTrue(self.connect(port))
 
+# wait worker process spawned from daemon exit.
+time.sleep(1)
+
 # request shutdown
 terminator(daemon)
-time.sleep(1)
+daemon.wait(5)
 
 # daemon should no longer accept connections
 try:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: Revert "[SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7"

2019-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new a065a50  Revert "[SPARK-28582][PYSPARK] Fix flaky test 
DaemonTests.do_termination_test which fail on Python 3.7"
a065a50 is described below

commit a065a503bcf1ec5b7d49c575af7bc6867c734d90
Author: HyukjinKwon 
AuthorDate: Fri Aug 2 22:12:04 2019 +0900

Revert "[SPARK-28582][PYSPARK] Fix flaky test 
DaemonTests.do_termination_test which fail on Python 3.7"

This reverts commit dc09a02c142d3787e728c8b25eb8417649d98e9f.
---
 python/pyspark/tests.py | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index fef2959..a2d825b 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -1916,7 +1916,7 @@ class OutputFormatTests(ReusedPySparkTestCase):
 
 class DaemonTests(unittest.TestCase):
 def connect(self, port):
-from socket import socket, AF_INET, SOCK_STREAM# request shutdown
+from socket import socket, AF_INET, SOCK_STREAM
 sock = socket(AF_INET, SOCK_STREAM)
 sock.connect(('127.0.0.1', port))
 # send a split index of -1 to shutdown the worker
@@ -1939,12 +1939,9 @@ class DaemonTests(unittest.TestCase):
 # daemon should accept connections
 self.assertTrue(self.connect(port))
 
-# wait worker process spawned from daemon exit.
-time.sleep(1)
-
 # request shutdown
 terminator(daemon)
-daemon.wait(5)
+time.sleep(1)
 
 # daemon should no longer accept connections
 try:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7

2019-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 20e46ef  [SPARK-28582][PYSPARK] Fix flaky test 
DaemonTests.do_termination_test which fail on Python 3.7
20e46ef is described below

commit 20e46ef6e3e49e754062717c2cb249c6eb99e86a
Author: WeichenXu 
AuthorDate: Fri Aug 2 22:07:06 2019 +0900

[SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which 
fail on Python 3.7

Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I 
add a sleep after the test connection to daemon.

Run test
```
python/run-tests --python-executables=python3.7 --testname 
"pyspark.tests.test_daemon DaemonTests"
```
**Before**
Fail on test "test_termination_sigterm". And we can see daemon process do 
not exit.
**After**
Test passed

Closes #25315 from WeichenXu123/fix_py37_daemon.

Authored-by: WeichenXu 
Signed-off-by: HyukjinKwon 
(cherry picked from commit fbeee0c5bcea32346b2279c5b67044f12e5faf7d)
Signed-off-by: HyukjinKwon 
---
 python/pyspark/tests.py | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index a2d825b..fc0ed41 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -1939,9 +1939,12 @@ class DaemonTests(unittest.TestCase):
 # daemon should accept connections
 self.assertTrue(self.connect(port))
 
+# wait worker process spawned from daemon exit.
+time.sleep(1)
+
 # request shutdown
 terminator(daemon)
-time.sleep(1)
+daemon.wait(5)
 
 # daemon should no longer accept connections
 try:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.3 updated: [SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7

2019-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.3 by this push:
 new 21ef0fd  [SPARK-28582][PYSPARK] Fix flaky test 
DaemonTests.do_termination_test which fail on Python 3.7
21ef0fd is described below

commit 21ef0fdf7244a169ac0c3e701cb21c35f1038a5d
Author: WeichenXu 
AuthorDate: Fri Aug 2 22:07:06 2019 +0900

[SPARK-28582][PYSPARK] Fix flaky test DaemonTests.do_termination_test which 
fail on Python 3.7

Fix flaky test DaemonTests.do_termination_test which fail on Python 3.7. I 
add a sleep after the test connection to daemon.

Run test
```
python/run-tests --python-executables=python3.7 --testname 
"pyspark.tests.test_daemon DaemonTests"
```
**Before**
Fail on test "test_termination_sigterm". And we can see daemon process do 
not exit.
**After**
Test passed

Closes #25315 from WeichenXu123/fix_py37_daemon.

Authored-by: WeichenXu 
Signed-off-by: HyukjinKwon 
(cherry picked from commit fbeee0c5bcea32346b2279c5b67044f12e5faf7d)
Signed-off-by: HyukjinKwon 
---
 python/pyspark/tests.py | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 5ba121f..5fc6887 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -1862,9 +1862,12 @@ class DaemonTests(unittest.TestCase):
 # daemon should accept connections
 self.assertTrue(self.connect(port))
 
+# wait worker process spawned from daemon exit.
+time.sleep(1)
+
 # request shutdown
 terminator(daemon)
-time.sleep(1)
+daemon.wait(5)
 
 # daemon should no longer accept connections
 try:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (babdba0 -> ef14237)

2019-08-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from babdba0  [SPARK-28728][BUILD] Bump Jackson Databind to 2.9.9.3
 add ef14237  [SPARK-28736][SPARK-28735][PYTHON][ML] Fix PySpark ML tests 
to pass in JDK 11

No new revisions were added by this update.

Summary of changes:
 python/pyspark/ml/tests/test_algorithms.py | 2 +-
 python/pyspark/mllib/clustering.py | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (31ef268 -> 37eedf6)

2019-08-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 31ef268  [SPARK-28639][CORE][DOC] Configuration doc for Barrier 
Execution Mode
 add 37eedf6  [SPARK-28652][TESTS][K8S] Add python version check for 
executor

No new revisions were added by this update.

Summary of changes:
 .../spark/deploy/k8s/integrationtest/PythonTestsSuite.scala   | 6 --
 resource-managers/kubernetes/integration-tests/tests/pyfiles.py   | 8 
 2 files changed, 12 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (3a4afce -> 3ec24fd)

2019-08-14 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3a4afce  [SPARK-28687][SQL] Support `epoch`, `isoyear`, `milliseconds` 
and `microseconds` at `extract()`
 add 3ec24fd  [SPARK-28203][CORE][PYTHON] PythonRDD should respect 
SparkContext's hadoop configuration

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/api/python/PythonHadoopUtil.scala |  2 +-
 .../org/apache/spark/api/python/PythonRDD.scala|  6 +-
 .../apache/spark/api/python/PythonRDDSuite.scala   | 87 +-
 3 files changed, 88 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (f0834d3 -> 4ddad79)

2019-08-18 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from f0834d3  Revert "[SPARK-28527][SQL][TEST] Re-run all the tests in 
SQLQueryTestSuite via Thrift Server"
 add 4ddad79  [SPARK-28598][SQL] Few date time manipulation functions does 
not provide versions supporting Column as input through the Dataframe API

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/sql/functions.scala | 46 --
 .../org/apache/spark/sql/DateFunctionsSuite.scala  | 11 ++
 2 files changed, 53 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (4ddad79 -> c96b615)

2019-08-18 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 4ddad79  [SPARK-28598][SQL] Few date time manipulation functions does 
not provide versions supporting Column as input through the Dataframe API
 add c96b615  [SPARK-28390][SQL][PYTHON][TESTS][FOLLOW-UP] Update the TODO 
with actual blocking JIRA IDs

No new revisions were added by this update.

Summary of changes:
 .../src/test/resources/sql-tests/inputs/udf/pgSQL/udf-select_having.sql | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28578][INFRA] Improve Github pull request template

2019-08-15 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0ea8db9  [SPARK-28578][INFRA] Improve Github pull request template
0ea8db9 is described below

commit 0ea8db9fd3d882140d8fa305dd69fc94db62cf8f
Author: HyukjinKwon 
AuthorDate: Fri Aug 16 09:45:15 2019 +0900

[SPARK-28578][INFRA] Improve Github pull request template



### What changes were proposed in this pull request?


This PR proposes to improve the Github template for better and faster 
review iterations and better interactions between PR authors and reviewers.

As suggested in the the [dev mailing 
list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-New-sections-in-Github-Pull-Request-description-template-td27527.html),
 this PR referred [Kubernates' PR 
template](https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md).

Therefore, those fields are newly added:

```
### Why are the changes needed?
### Does this PR introduce any user-facing change?
```

and some comments were added.

### Why are the changes needed?


Currently, many PR descriptions are poorly formatted, which causes some 
overheads between PR authors and reviewers.

There are multiple problems by those poorly formatted PR descriptions:

- Some PRs still write single line in PR description with 500+- code 
changes in a critical path.
- Some PRs do not describe behaviour changes and reviewers need to find and 
document.
- Some PRs are hard to review without outlines but they are not mentioned 
sometimes.
- Spark is being old and sometimes we need to track the history deep. Due 
to poorly formatted PR description,  sometimes it requires to read whole codes 
of whole commit histories to find the root cause of a bug.
- Reviews take a while but the number of PR still grows.

This PR targets to alleviate the problems and situation.

### Does this PR introduce any user-facing change?


Yes, it changes the PR templates when PRs are open. This PR uses the 
template this PR proposes.

### How was this patch tested?


Manually tested via Github preview feature.

Closes #25310 from HyukjinKwon/SPARK-28578.

Lead-authored-by: HyukjinKwon 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: HyukjinKwon 
---
 .github/PULL_REQUEST_TEMPLATE | 44 +--
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/.github/PULL_REQUEST_TEMPLATE b/.github/PULL_REQUEST_TEMPLATE
index e7ed23d..be57f00 100644
--- a/.github/PULL_REQUEST_TEMPLATE
+++ b/.github/PULL_REQUEST_TEMPLATE
@@ -1,10 +1,42 @@
-## What changes were proposed in this pull request?
+
 
-(Please fill in changes proposed in this fix)
+### What changes were proposed in this pull request?
+
 
-## How was this patch tested?
 
-(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)
-(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
+### Why are the changes needed?
+
 
-Please review https://spark.apache.org/contributing.html before opening a pull 
request.
+
+### Does this PR introduce any user-facing change?
+
+
+
+### How was this patch tested?
+


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (25857c6 -> ec84415)

2019-08-12 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 25857c6  [SPARK-28647][WEBUI] Recover additional metric feature and 
remove additional-metrics.js
 add ec84415  [SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases 
into group by clause in 'udf-group-by.sql'

No new revisions were added by this update.

Summary of changes:
 .../sql-tests/inputs/udf/udf-group-by.sql  |  28 +--
 .../sql-tests/results/udf/udf-group-by.sql.out | 273 +++--
 2 files changed, 154 insertions(+), 147 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: Revert "[SPARK-28759][BUILD] Upgrade scala-maven-plugin to 4.1.1"

2019-08-19 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1de4a22  Revert "[SPARK-28759][BUILD] Upgrade scala-maven-plugin to 
4.1.1"
1de4a22 is described below

commit 1de4a22c52779bbdf68e40167a91e8606225f6b7
Author: HyukjinKwon 
AuthorDate: Mon Aug 19 20:31:39 2019 +0900

Revert "[SPARK-28759][BUILD] Upgrade scala-maven-plugin to 4.1.1"

This reverts commit 1819a6f22eee5314197aab4c169c74bd6ff6c17c.
---
 pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pom.xml b/pom.xml
index 3b03833..140d19e 100644
--- a/pom.xml
+++ b/pom.xml
@@ -2280,7 +2280,7 @@
 
   net.alchim31.maven
   scala-maven-plugin
-  4.1.1
+  3.4.4
   
 
   eclipse-add-source


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28756][R][FOLLOW-UP] Specify minimum and maximum Java versions

2019-08-19 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2fd83c2  [SPARK-28756][R][FOLLOW-UP] Specify minimum and maximum Java 
versions
2fd83c2 is described below

commit 2fd83c28203ef9c300a3feaaecc8edb5546814cf
Author: HyukjinKwon 
AuthorDate: Mon Aug 19 20:15:17 2019 +0900

[SPARK-28756][R][FOLLOW-UP] Specify minimum and maximum Java versions



### What changes were proposed in this pull request?


This PR proposes to set minimum and maximum Java version specification. 
(see 
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-portable-packages).

Seems there is not the standard way to specify both given the documentation 
and other packages (see 
https://gist.github.com/glin/bd36cf1eb0c7f8b1f511e70e2fb20f8d).

I found two ways from existing packages on CRAN.

```
Package (<= 1 & > 2)
Package (<= 1, > 2)
```

The latter seems closer to other standard notations such as `R (>= 2.14.0), 
R (>= r56550)`. So I have chosen the latter way.

### Why are the changes needed?


Seems the package might be rejected by CRAN. See 
https://github.com/apache/spark/pull/25472#issuecomment-522405742

### Does this PR introduce any user-facing change?


No.

### How was this patch tested?


JDK 8

```bash
./build/mvn -DskipTests -Psparkr clean package
./R/run-tests.sh

...
basic tests for CRAN: .
...
```

JDK 11

```bash
./build/mvn -DskipTests -Psparkr -Phadoop-3.2 clean package
./R/run-tests.sh

...
basic tests for CRAN: .
...
```

Closes #25490 from HyukjinKwon/SPARK-28756.

Authored-by: HyukjinKwon 
Signed-off-by: HyukjinKwon 
---
 R/pkg/DESCRIPTION |  2 +-
 R/pkg/R/client.R  | 13 -
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 6a83e00..f478086 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -13,7 +13,7 @@ Authors@R: c(person("Shivaram", "Venkataraman", role = 
c("aut", "cre"),
 License: Apache License (== 2.0)
 URL: https://www.apache.org/ https://spark.apache.org/
 BugReports: https://spark.apache.org/contributing.html
-SystemRequirements: Java (>= 8)
+SystemRequirements: Java (>= 8, < 12)
 Depends:
 R (>= 3.1),
 methods
diff --git a/R/pkg/R/client.R b/R/pkg/R/client.R
index 3299346..2ff68ab 100644
--- a/R/pkg/R/client.R
+++ b/R/pkg/R/client.R
@@ -64,7 +64,9 @@ checkJavaVersion <- function() {
   javaBin <- "java"
   javaHome <- Sys.getenv("JAVA_HOME")
   javaReqs <- utils::packageDescription(utils::packageName(), fields = 
c("SystemRequirements"))
-  sparkJavaVersion <- as.numeric(tail(strsplit(javaReqs, "[(=)]")[[1]], n = 
1L))
+  sparkJavaVersions <- strsplit(javaReqs, "[(,)]")[[1]]
+  minJavaVersion <- as.numeric(strsplit(sparkJavaVersions[[2]], ">= 
")[[1]][[2]])
+  maxJavaVersion <- as.numeric(strsplit(sparkJavaVersions[[3]], "< 
")[[1]][[2]])
   if (javaHome != "") {
 javaBin <- file.path(javaHome, "bin", javaBin)
   }
@@ -99,10 +101,11 @@ checkJavaVersion <- function() {
   } else {
 javaVersionNum <- as.integer(versions[1])
   }
-  if (javaVersionNum < sparkJavaVersion) {
-stop(paste("Java version", sparkJavaVersion,
-   ", or greater, is required for this package; found version:",
-   javaVersionStr))
+  if (javaVersionNum < minJavaVersion || javaVersionNum >= maxJavaVersion) {
+stop(paste0("Java version, greater than or equal to ", minJavaVersion,
+" and less than ", maxJavaVersion,
+", is required for this package; found version: ",
+javaVersionStr))
   }
   return(javaVersionNum)
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (97dc4c0 -> ec14b6e)

2019-08-19 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 97dc4c0  [SPARK-28744][SQL][TEST] rename SharedSQLContext to 
SharedSparkSession
 add ec14b6e  [SPARK-28393][SQL][PYTHON][TESTS] Convert and port 
'pgSQL/join.sql' into UDF test base

No new revisions were added by this update.

Summary of changes:
 .../{pgSQL/join.sql => udf/pgSQL/udf-join.sql} | 450 
 .../join.sql.out => udf/pgSQL/udf-join.sql.out}| 578 ++---
 2 files changed, 515 insertions(+), 513 deletions(-)
 copy sql/core/src/test/resources/sql-tests/inputs/{pgSQL/join.sql => 
udf/pgSQL/udf-join.sql} (80%)
 copy sql/core/src/test/resources/sql-tests/results/{pgSQL/join.sql.out => 
udf/pgSQL/udf-join.sql.out} (71%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (7701d29 -> ab1819d)

2019-08-27 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7701d29  [SPARK-28877][PYSPARK][test-hadoop3.2][test-java11] Make 
jaxb-runtime compile-time dependency
 add ab1819d  [SPARK-28527][SQL][TEST][FOLLOW-UP] Ignores Thrift server 
ThriftServerQueryTestSuite

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala | 2 ++
 1 file changed, 2 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (02a0cde -> 4b16cf1)

2019-08-25 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 02a0cde  [SPARK-28723][SQL] Upgrade to Hive 2.3.6 for HiveMetastore 
Client and Hadoop-3.2 profile
 add 4b16cf1  [SPARK-27988][SQL][TEST] Port AGGREGATES.sql [Part 3]

No new revisions were added by this update.

Summary of changes:
 .../sql-tests/inputs/pgSQL/aggregates_part3.sql| 270 +
 .../results/pgSQL/aggregates_part3.sql.out |  22 ++
 2 files changed, 292 insertions(+)
 create mode 100644 
sql/core/src/test/resources/sql-tests/inputs/pgSQL/aggregates_part3.sql
 create mode 100644 
sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (00cb2f9 -> c02c86e)

2019-08-27 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 00cb2f9  [SPARK-28881][PYTHON][TESTS] Add a test to make sure toPandas 
with Arrow optimization throws an exception per maxResultSize
 add c02c86e  [SPARK-28691][EXAMPLES] Add Java/Scala 
DirectKerberizedKafkaWordCount examples

No new revisions were added by this update.

Summary of changes:
 ...ava => JavaDirectKerberizedKafkaWordCount.java} | 66 ++
 scala => DirectKerberizedKafkaWordCount.scala} | 54 +++---
 2 files changed, 101 insertions(+), 19 deletions(-)
 copy 
examples/src/main/java/org/apache/spark/examples/streaming/{JavaDirectKafkaWordCount.java
 => JavaDirectKerberizedKafkaWordCount.java} (54%)
 copy 
examples/src/main/scala/org/apache/spark/examples/streaming/{DirectKafkaWordCount.scala
 => DirectKerberizedKafkaWordCount.scala} (54%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (90b10b4 -> 8848af2)

2019-08-27 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 90b10b4  [HOT-FIX] fix compilation
 add 8848af2  [SPARK-28881][PYTHON][TESTS][FOLLOW-UP] Use 
SparkSession(SparkContext(...)) to prevent for Spark conf to affect other tests

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/test_arrow.py | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r35482 - /dev/spark/v2.4.4-rc3-bin/ /release/spark/spark-2.4.4/

2019-08-30 Thread gurwls223

Author: gurwls223
Date: Sat Aug 31 05:15:37 2019
New Revision: 35482

Log: (empty)

Added:
release/spark/spark-2.4.4/
  - copied from r35481, dev/spark/v2.4.4-rc3-bin/
Removed:
dev/spark/v2.4.4-rc3-bin/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (d5688dc -> 5cf2602)

2019-09-03 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d5688dc  [SPARK-28573][SQL] Convert InsertIntoTable(HiveTableRelation) 
to DataSource inserting for partitioned table
 add 5cf2602  [SPARK-28946][R][DOCS] Add some more information about 
building SparkR on Windows

No new revisions were added by this update.

Summary of changes:
 R/WINDOWS.md | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28963][BUILD] Fall back to archive.apache.org in build/mvn for older releases

2019-09-03 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new df39855  [SPARK-28963][BUILD] Fall back to archive.apache.org in 
build/mvn for older releases
df39855 is described below

commit df39855db826fd4bead85a2ca01eda15c101bbbe
Author: Sean Owen 
AuthorDate: Wed Sep 4 13:11:09 2019 +0900

[SPARK-28963][BUILD] Fall back to archive.apache.org in build/mvn for older 
releases

### What changes were proposed in this pull request?

Fall back to archive.apache.org in `build/mvn` to download Maven, in case 
the ASF mirrors no longer have an older release.

### Why are the changes needed?

If an older release's specified Maven doesn't exist in the mirrors, 
{{build/mvn}} will fail.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manually tested different paths and failures by commenting in/out parts of 
the script and modifying it directly.

Closes #25667 from srowen/SPARK-28963.

Authored-by: Sean Owen 
Signed-off-by: HyukjinKwon 
---
 build/mvn | 9 +
 1 file changed, 9 insertions(+)

diff --git a/build/mvn b/build/mvn
index 75feb2f..f68377b 100755
--- a/build/mvn
+++ b/build/mvn
@@ -80,6 +80,15 @@ install_mvn() {
   fi
   if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then
 local 
APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download='}
+
+if [ $(command -v curl) ]; then
+  local 
TEST_MIRROR_URL="${APACHE_MIRROR}/maven/maven-3/${MVN_VERSION}/binaries/apache-maven-${MVN_VERSION}-bin.tar.gz"
+  if ! curl -L --output /dev/null --silent --head --fail 
"$TEST_MIRROR_URL" ; then
+# Fall back to archive.apache.org for older Maven
+echo "Falling back to archive.apache.org to download Maven"
+APACHE_MIRROR="https://archive.apache.org/dist;
+  fi
+fi
 
 install_app \
   "${APACHE_MIRROR}/maven/maven-3/${MVN_VERSION}/binaries" \


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (4c8f114 -> a838699)

2019-09-03 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 4c8f114  [SPARK-27489][WEBUI] UI updates to show executor resource 
information
 add a838699  [SPARK-28694][EXAMPLES] Add Java/Scala 
StructuredKerberizedKafkaWordCount examples

No new revisions were added by this update.

Summary of changes:
 ...=> JavaStructuredKerberizedKafkaWordCount.java} | 57 ++
 ...la => StructuredKerberizedKafkaWordCount.scala} | 56 +
 2 files changed, 94 insertions(+), 19 deletions(-)
 copy 
examples/src/main/java/org/apache/spark/examples/sql/streaming/{JavaStructuredKafkaWordCount.java
 => JavaStructuredKerberizedKafkaWordCount.java} (53%)
 copy 
examples/src/main/scala/org/apache/spark/examples/sql/streaming/{StructuredKafkaWordCount.scala
 => StructuredKerberizedKafkaWordCount.scala} (56%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (f8f7c52 -> 31b59bd)

2019-08-29 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from f8f7c52  [SPARK-28899][SQL][TEST] merge the testing in-memory v2 
catalogs from catalyst and core
 add 31b59bd  [SPARK-28843][PYTHON] Set OMP_NUM_THREADS to executor cores 
for python if not set

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/spark/api/python/PythonRunner.scala| 7 +++
 core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala   | 9 +
 2 files changed, 16 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (98e1a4c -> 1fd7f29)

2019-08-23 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 98e1a4c  [SPARK-28319][SQL] Implement SHOW TABLES for Data Source V2 
Tables
 add 1fd7f29  [SPARK-28857][INFRA] Clean up the comments of PR template 
during merging

No new revisions were added by this update.

Summary of changes:
 dev/merge_spark_pr.py | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (9f8c7a2 -> 13fd32c)

2019-08-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 9f8c7a2  [SPARK-28709][DSTREAMS] Fix StreamingContext leak through 
Streaming
 add 13fd32c  [SPARK-28701][TEST-HADOOP3.2][TEST-JAVA11][K8S] adding java11 
support for pull request builds

No new revisions were added by this update.

Summary of changes:
 dev/run-tests-jenkins.py   | 2 +-
 dev/run-tests.py   | 7 +++
 .../kubernetes/integration-tests/dev/dev-run-integration-tests.sh  | 6 ++
 3 files changed, 14 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c61270f -> 6e12b58)

2019-08-26 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c61270f  [SPARK-27395][SQL] Improve EXPLAIN command
 add 6e12b58  [SPARK-28527][SQL][TEST] Re-run all the tests in 
SQLQueryTestSuite via Thrift Server

No new revisions were added by this update.

Summary of changes:
 project/SparkBuild.scala   |   3 +-
 .../org/apache/spark/sql/SQLQueryTestSuite.scala   |  86 ++---
 sql/hive-thriftserver/pom.xml  |   7 +
 .../sql/hive/thriftserver/HiveThriftServer2.scala  |   3 +-
 .../thriftserver/ThriftServerQueryTestSuite.scala  | 358 +
 5 files changed, 416 insertions(+), 41 deletions(-)
 create mode 100644 
sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (19f882c -> bd3915e)

2019-09-01 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 19f882c  [SPARK-28933][ML] Reduce unnecessary shuffle in ALS when 
initializing factors
 add bd3915e  Revert "[SPARK-28612][SQL] Add DataFrameWriterV2 API"

No new revisions were added by this update.

Summary of changes:
 .../catalyst/expressions/PartitionTransforms.scala |  77 
 .../spark/sql/catalyst/analysis/Analyzer.scala |   6 +-
 .../plans/logical/basicLogicalOperators.scala  |  47 +-
 .../datasources/v2/DataSourceV2Implicits.scala |   9 -
 .../apache/spark/sql/connector/InMemoryTable.scala |   5 +-
 .../org/apache/spark/sql/DataFrameWriter.scala |  11 +-
 .../org/apache/spark/sql/DataFrameWriterV2.scala   | 365 ---
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  28 --
 .../datasources/v2/DataSourceV2Strategy.scala  |  20 +-
 .../datasources/v2/V2WriteSupportCheck.scala   |   6 +-
 .../scala/org/apache/spark/sql/functions.scala |  64 ---
 .../sql/sources/v2/DataFrameWriterV2Suite.scala| 508 -
 12 files changed, 36 insertions(+), 1110 deletions(-)
 delete mode 100644 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/PartitionTransforms.scala
 delete mode 100644 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala
 delete mode 100644 
sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataFrameWriterV2Suite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (bd3915e -> e1946a5)

2019-09-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from bd3915e  Revert "[SPARK-28612][SQL] Add DataFrameWriterV2 API"
 add e1946a5  [SPARK-28705][SQL][TEST] Drop tables after being used in 
AnalysisExternalCatalogSuite

No new revisions were added by this update.

Summary of changes:
 .../analysis/AnalysisExternalCatalogSuite.scala| 48 --
 1 file changed, 27 insertions(+), 21 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28881][PYTHON][TESTS] Add a test to make sure toPandas with Arrow optimization throws an exception per maxResultSize

2019-08-27 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 00cb2f9  [SPARK-28881][PYTHON][TESTS] Add a test to make sure toPandas 
with Arrow optimization throws an exception per maxResultSize
00cb2f9 is described below

commit 00cb2f99ccbd7c0fdba19ba63c4ec73ca97dea66
Author: HyukjinKwon 
AuthorDate: Tue Aug 27 17:30:06 2019 +0900

[SPARK-28881][PYTHON][TESTS] Add a test to make sure toPandas with Arrow 
optimization throws an exception per maxResultSize

### What changes were proposed in this pull request?
This PR proposes to add a test case for:

```bash
./bin/pyspark --conf spark.driver.maxResultSize=1m
spark.conf.set("spark.sql.execution.arrow.enabled",True)
```

```python
spark.range(1000).toPandas()
```

```
Empty DataFrame
Columns: [id]
Index: []
```

which can result in partial results (see 
https://github.com/apache/spark/pull/25593#issuecomment-525153808). This 
regression was found between Spark 2.3 and Spark 2.4, and accidentally fixed.

### Why are the changes needed?
To prevent the same regression in the future.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Test was added.

Closes #25594 from HyukjinKwon/SPARK-28881.

Authored-by: HyukjinKwon 
Signed-off-by: HyukjinKwon 
---
 python/pyspark/sql/tests/test_arrow.py | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/test_arrow.py 
b/python/pyspark/sql/tests/test_arrow.py
index f533083..50c82b0 100644
--- a/python/pyspark/sql/tests/test_arrow.py
+++ b/python/pyspark/sql/tests/test_arrow.py
@@ -22,7 +22,7 @@ import time
 import unittest
 import warnings
 
-from pyspark.sql import Row
+from pyspark.sql import Row, SparkSession
 from pyspark.sql.functions import udf
 from pyspark.sql.types import *
 from pyspark.testing.sqlutils import ReusedSQLTestCase, have_pandas, 
have_pyarrow, \
@@ -421,6 +421,35 @@ class ArrowTests(ReusedSQLTestCase):
 run_test(*case)
 
 
+@unittest.skipIf(
+not have_pandas or not have_pyarrow,
+pandas_requirement_message or pyarrow_requirement_message)
+class MaxResultArrowTests(unittest.TestCase):
+# These tests are separate as 'spark.driver.maxResultSize' configuration
+# is a static configuration to Spark context.
+
+@classmethod
+def setUpClass(cls):
+cls.spark = SparkSession.builder \
+.master("local[4]") \
+.appName(cls.__name__) \
+.config("spark.driver.maxResultSize", "10k") \
+.getOrCreate()
+
+# Explicitly enable Arrow and disable fallback.
+cls.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
+
cls.spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", 
"false")
+
+@classmethod
+def tearDownClass(cls):
+if hasattr(cls, "spark"):
+cls.spark.stop()
+
+def test_exception_by_max_results(self):
+with self.assertRaisesRegexp(Exception, "is bigger than"):
+self.spark.range(0, 1, 1, 100).toPandas()
+
+
 class EncryptionArrowTests(ArrowTests):
 
 @classmethod


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (7452786 -> 137b20b)

2019-08-28 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7452786  [SPARK-28789][DOCS][SQL] Document ALTER DATABASE command
 add 137b20b  [SPARK-28818][SQL] Respect source column nullability in the 
arrays created by `freqItems()`

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/stat/FrequentItems.scala   | 19 
 .../org/apache/spark/sql/DataFrameStatSuite.scala  | 26 +-
 2 files changed, 35 insertions(+), 10 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c18f849 -> 7ce0f2b)

2019-09-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c18f849  [SPARK-24663][STREAMING][TESTS] StreamingContextSuite: Wait 
until slow receiver has been initialized, but with hard timeout
 add 7ce0f2b  [SPARK-29041][PYTHON] Allows createDataFrame to accept bytes 
as binary type

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/test_serde.py | 4 
 python/pyspark/sql/types.py| 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (8f632d7 -> 850833f)

2019-09-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 8f632d7  [MINOR][DOCS] Fix few typos in the java docs
 add 850833f  [SPARK-29046][SQL] Fix NPE in SQLConf.get when active 
SparkContext is stopping

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/sql/internal/SQLConf.scala |  3 ++-
 .../org/apache/spark/sql/internal/SQLConfSuite.scala  | 19 +++
 2 files changed, 21 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][DOCS] Fix few typos in the java docs

2019-09-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8f632d7  [MINOR][DOCS] Fix few typos in the java docs
8f632d7 is described below

commit 8f632d70455156010f0e87288541304ad2164a52
Author: dengziming 
AuthorDate: Thu Sep 12 09:30:03 2019 +0900

[MINOR][DOCS] Fix few typos in the java docs

JIRA :https://issues.apache.org/jira/browse/SPARK-29050
'a hdfs' change into  'an hdfs'
'an unique' change into 'a unique'
'an url' change into 'a url'
'a error' change into 'an error'

Closes #25756 from dengziming/feature_fix_typos.

Authored-by: dengziming 
Signed-off-by: HyukjinKwon 
---
 R/pkg/R/context.R | 4 ++--
 core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala  | 2 +-
 core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala  | 2 +-
 core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala | 2 +-
 core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala| 2 +-
 docs/spark-standalone.md  | 2 +-
 .../org/apache/spark/streaming/kinesis/KinesisCheckpointer.scala  | 2 +-
 python/pyspark/context.py | 2 +-
 .../sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala  | 4 ++--
 .../scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala | 4 ++--
 sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala | 2 +-
 .../src/test/resources/ql/src/test/queries/clientpositive/load_fs2.q  | 2 +-
 .../main/scala/org/apache/spark/streaming/dstream/InputDStream.scala  | 4 ++--
 13 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index 51ae2d2..93ba130 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -301,7 +301,7 @@ broadcastRDD <- function(sc, object) {
 #' Set the checkpoint directory
 #'
 #' Set the directory under which RDDs are going to be checkpointed. The
-#' directory must be a HDFS path if running on a cluster.
+#' directory must be an HDFS path if running on a cluster.
 #'
 #' @param sc Spark Context to use
 #' @param dirName Directory path
@@ -446,7 +446,7 @@ setLogLevel <- function(level) {
 #' Set checkpoint directory
 #'
 #' Set the directory under which SparkDataFrame are going to be checkpointed. 
The directory must be
-#' a HDFS path if running on a cluster.
+#' an HDFS path if running on a cluster.
 #'
 #' @rdname setCheckpointDir
 #' @param directory Directory path to checkpoint to
diff --git 
a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala 
b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
index 330c2f6..3485128 100644
--- a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
@@ -609,7 +609,7 @@ class JavaSparkContext(val sc: SparkContext) extends 
Closeable {
 
   /**
* Set the directory under which RDDs are going to be checkpointed. The 
directory must
-   * be a HDFS path if running on a cluster.
+   * be an HDFS path if running on a cluster.
*/
   def setCheckpointDir(dir: String) {
 sc.setCheckpointDir(dir)
diff --git a/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala 
b/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala
index c96640a..b552444 100644
--- a/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala
+++ b/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala
@@ -124,7 +124,7 @@ private[spark] class MetricsSystem private (
* If either ID is not available, this defaults to just using .
*
* @param source Metric source to be named by this method.
-   * @return An unique metric name for each combination of
+   * @return A unique metric name for each combination of
* application, executor/driver and metric source.
*/
   private[spark] def buildRegistryName(source: Source): String = {
diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala 
b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
index d188bdd..49e32d0 100644
--- a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
+++ b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
@@ -27,7 +27,7 @@ import org.apache.spark.util.Utils
 
 /**
  * :: DeveloperApi ::
- * This class represent an unique identifier for a BlockManager.
+ * This class represent a unique identifier for a BlockManager.
  *
  * The first 2 constructors of this class are made private to ensure that 
BlockManagerId objects
  * can be created only using the apply method in the companion object. This 
allows de-duplication
diff --git a/core/src/test/sca

[spark] branch branch-2.4 updated: [MINOR][DOCS] Fix few typos in the java docs

2019-09-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new ecb2052  [MINOR][DOCS] Fix few typos in the java docs
ecb2052 is described below

commit ecb2052bf0cf7dea749cca10d864f7383eeb1224
Author: dengziming 
AuthorDate: Thu Sep 12 09:30:03 2019 +0900

[MINOR][DOCS] Fix few typos in the java docs

JIRA :https://issues.apache.org/jira/browse/SPARK-29050
'a hdfs' change into  'an hdfs'
'an unique' change into 'a unique'
'an url' change into 'a url'
'a error' change into 'an error'

Closes #25756 from dengziming/feature_fix_typos.

Authored-by: dengziming 
Signed-off-by: HyukjinKwon 
(cherry picked from commit 8f632d70455156010f0e87288541304ad2164a52)
Signed-off-by: HyukjinKwon 
---
 R/pkg/R/context.R | 4 ++--
 core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala  | 2 +-
 core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala  | 2 +-
 core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala | 2 +-
 core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala| 2 +-
 docs/spark-standalone.md  | 2 +-
 .../org/apache/spark/streaming/kinesis/KinesisCheckpointer.scala  | 2 +-
 python/pyspark/context.py | 2 +-
 .../sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala  | 4 ++--
 .../scala/org/apache/spark/sql/streaming/StreamingQueryListener.scala | 4 ++--
 sql/core/src/main/scala/org/apache/spark/sql/streaming/progress.scala | 2 +-
 .../src/test/resources/ql/src/test/queries/clientpositive/load_fs2.q  | 2 +-
 .../main/scala/org/apache/spark/streaming/dstream/InputDStream.scala  | 4 ++--
 13 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/R/pkg/R/context.R b/R/pkg/R/context.R
index b49f7c3..f1a6b84 100644
--- a/R/pkg/R/context.R
+++ b/R/pkg/R/context.R
@@ -297,7 +297,7 @@ broadcastRDD <- function(sc, object) {
 #' Set the checkpoint directory
 #'
 #' Set the directory under which RDDs are going to be checkpointed. The
-#' directory must be a HDFS path if running on a cluster.
+#' directory must be an HDFS path if running on a cluster.
 #'
 #' @param sc Spark Context to use
 #' @param dirName Directory path
@@ -442,7 +442,7 @@ setLogLevel <- function(level) {
 #' Set checkpoint directory
 #'
 #' Set the directory under which SparkDataFrame are going to be checkpointed. 
The directory must be
-#' a HDFS path if running on a cluster.
+#' an HDFS path if running on a cluster.
 #'
 #' @rdname setCheckpointDir
 #' @param directory Directory path to checkpoint to
diff --git 
a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala 
b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
index 09c8384..09e9910 100644
--- a/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
@@ -713,7 +713,7 @@ class JavaSparkContext(val sc: SparkContext)
 
   /**
* Set the directory under which RDDs are going to be checkpointed. The 
directory must
-   * be a HDFS path if running on a cluster.
+   * be an HDFS path if running on a cluster.
*/
   def setCheckpointDir(dir: String) {
 sc.setCheckpointDir(dir)
diff --git a/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala 
b/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala
index 3457a26..657d75c 100644
--- a/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala
+++ b/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala
@@ -122,7 +122,7 @@ private[spark] class MetricsSystem private (
* If either ID is not available, this defaults to just using .
*
* @param source Metric source to be named by this method.
-   * @return An unique metric name for each combination of
+   * @return A unique metric name for each combination of
* application, executor/driver and metric source.
*/
   private[spark] def buildRegistryName(source: Source): String = {
diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala 
b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
index d4a59c3..83cd4f0 100644
--- a/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
+++ b/core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala
@@ -27,7 +27,7 @@ import org.apache.spark.util.Utils
 
 /**
  * :: DeveloperApi ::
- * This class represent an unique identifier for a BlockManager.
+ * This class represent a unique identifier for a BlockManager.
  *
  * The first 2 constructors of this class are made private to ensure that 
BlockManagerId objects
  * can be created only using the apply

[spark] branch master updated (7ce0f2b -> eec728a)

2019-09-11 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7ce0f2b  [SPARK-29041][PYTHON] Allows createDataFrame to accept bytes 
as binary type
 add eec728a  [SPARK-29057][SQL] remove InsertIntoTable

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala |  4 +--
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  9 +++---
 .../plans/logical/basicLogicalOperators.scala  | 36 --
 .../org/apache/spark/sql/DataFrameWriter.scala |  9 +++---
 .../execution/datasources/DataSourceStrategy.scala | 11 ---
 .../datasources/FallBackFileSourceV2.scala |  7 +++--
 .../spark/sql/execution/datasources/rules.scala| 25 +++
 .../spark/sql/util/DataFrameCallbackSuite.scala|  7 +++--
 .../org/apache/spark/sql/hive/HiveStrategies.scala | 15 -
 .../org/apache/spark/sql/hive/InsertSuite.scala|  1 -
 10 files changed, 46 insertions(+), 78 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated (0a4b356 -> 483dcf5)

2019-09-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 0a4b356  Revert "[SPARK-28912][STREAMING] Fixed MatchError in 
getCheckpointFiles()"
 add 483dcf5  [SPARK-28912][BRANCH-2.4] Fixed MatchError in 
getCheckpointFiles()

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/streaming/Checkpoint.scala  |  4 ++--
 .../org/apache/spark/streaming/CheckpointSuite.scala | 20 
 2 files changed, 22 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (54d3f6e -> fa75db2)

2019-09-10 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 54d3f6e  [SPARK-28982][SQL] Implementation Spark's own 
GetTypeInfoOperation
 add fa75db2  [SPARK-29026][SQL] Improve error message in `schemaFor` in 
trait without companion object constructor

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/ScalaReflection.scala   | 13 -
 .../spark/sql/catalyst/ScalaReflectionSuite.scala  | 22 ++
 2 files changed, 34 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c862835 -> db996cc)

2019-09-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c862835  [SPARK-28996][SQL][TESTS] Add tests regarding username of 
HiveClient
 add db996cc  [SPARK-29074][SQL] Optimize `date_format` for foldable `fmt`

No new revisions were added by this update.

Summary of changes:
 .../catalyst/expressions/datetimeExpressions.scala | 32 --
 sql/core/benchmarks/DateTimeBenchmark-results.txt  |  4 +--
 2 files changed, 26 insertions(+), 10 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (104b9b6 -> 34915b2)

2019-09-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 104b9b6  [SPARK-28483][FOLLOW-UP] Fix flaky test in 
BarrierTaskContextSuite
 add 34915b2  [SPARK-29104][CORE][TESTS] Fix PipedRDDSuite to use 
`eventually` to check thread termination

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/rdd/PipedRDDSuite.scala| 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (723faad -> a754674)

2019-09-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 723faad  [SPARK-28912][STREAMING] Fixed MatchError in 
getCheckpointFiles()
 add a754674  [SPARK-28000][SQL][TEST] Port comments.sql

No new revisions were added by this update.

Summary of changes:
 .../resources/sql-tests/inputs/pgSQL/comments.sql  |  48 +
 .../sql-tests/results/pgSQL/comments.sql.out   | 196 +
 2 files changed, 244 insertions(+)
 create mode 100644 
sql/core/src/test/resources/sql-tests/inputs/pgSQL/comments.sql
 create mode 100644 
sql/core/src/test/resources/sql-tests/results/pgSQL/comments.sql.out


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (aa805ec -> 86fc890)

2019-09-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from aa805ec  [SPARK-23265][ML] Update multi-column error handling logic in 
QuantileDiscretizer
 add 86fc890  [SPARK-28988][SQL][TESTS] Fix invalid tests in CliSuite

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/hive/thriftserver/CliSuite.scala  | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (4559a82 -> eef5e6d)

2019-09-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 4559a82  [SPARK-28930][SQL] Last Access Time value shall display 
'UNKNOWN' in all clients
 add eef5e6d  [SPARK-29113][DOC] Fix some annotation errors and remove 
meaningless annotations in project

No new revisions were added by this update.

Summary of changes:
 .../main/java/org/apache/spark/io/NioBufferedFileInputStream.java  | 1 -
 core/src/main/java/org/apache/spark/memory/MemoryConsumer.java | 1 -
 .../java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java  | 1 -
 .../spark/util/collection/unsafe/sort/UnsafeExternalSorter.java| 2 --
 .../scala/org/apache/spark/deploy/history/ApplicationCache.scala   | 7 +++
 .../main/scala/org/apache/spark/scheduler/TaskSetBlacklist.scala   | 1 -
 core/src/main/scala/org/apache/spark/storage/BlockManager.scala| 1 -
 .../apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala | 1 -
 .../main/scala/org/apache/spark/sql/execution/ExplainUtils.scala   | 4 ++--
 9 files changed, 5 insertions(+), 14 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (05988b2 -> 3ece8ee)

2019-09-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 05988b2  [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas 
UDFs
 add 3ece8ee  [SPARK-29124][CORE] Use MurmurHash3 `bytesHash(data, seed)` 
instead of `bytesHash(data)`

No new revisions were added by this update.

Summary of changes:
 core/src/main/scala/org/apache/spark/util/random/XORShiftRandom.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (3ece8ee -> 4559a82)

2019-09-17 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3ece8ee  [SPARK-29124][CORE] Use MurmurHash3 `bytesHash(data, seed)` 
instead of `bytesHash(data)`
 add 4559a82  [SPARK-28930][SQL] Last Access Time value shall display 
'UNKNOWN' in all clients

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/sql/catalyst/catalog/interface.scala| 7 +--
 .../scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala   | 2 +-
 2 files changed, 6 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (12e1583 -> c2734ab)

2019-09-18 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 12e1583  [SPARK-28927][ML] Rethrow block mismatch exception in ALS 
when input data is nondeterministic
 add c2734ab  [SPARK-29012][SQL] Support special timestamp values

No new revisions were added by this update.

Summary of changes:
 .../catalyst/expressions/datetimeExpressions.scala |   4 +-
 .../spark/sql/catalyst/util/DateTimeUtils.scala|  67 
 .../sql/catalyst/util/TimestampFormatter.scala |  21 ++-
 .../sql/catalyst/util/DateTimeUtilsSuite.scala |  36 +++-
 .../spark/sql/util/TimestampFormatterSuite.scala   |  28 ++-
 .../resources/sql-tests/inputs/pgSQL/timestamp.sql |  29 ++--
 .../sql-tests/results/pgSQL/timestamp.sql.out  | 190 -
 .../org/apache/spark/sql/CsvFunctionsSuite.scala   |  10 ++
 .../org/apache/spark/sql/JsonFunctionsSuite.scala  |  10 ++
 9 files changed, 317 insertions(+), 78 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c2734ab -> 203bf9e)

2019-09-18 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c2734ab  [SPARK-29012][SQL] Support special timestamp values
 add 203bf9e  [SPARK-19926][PYSPARK] make captured exception from JVM side 
user friendly

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/test_utils.py | 7 +++
 python/pyspark/sql/utils.py| 8 +++-
 2 files changed, 14 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (7d4eb38 -> 1b99d0c)

2019-09-15 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7d4eb38  [SPARK-29052][DOCS][ML][PYTHON][CORE][R][SQL][SS] Create a 
Migration Guide tap in Spark documentation
 add 1b99d0c  [SPARK-29069][SQL] ResolveInsertInto should not do table 
lookup

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala | 58 ++
 1 file changed, 27 insertions(+), 31 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28971][SQL][PYTHON][TESTS] Convert and port 'pgSQL/aggregates_part4.sql' into UDF test base

2019-09-05 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new be04c97  [SPARK-28971][SQL][PYTHON][TESTS] Convert and port 
'pgSQL/aggregates_part4.sql' into UDF test base
be04c97 is described below

commit be04c972623df0c44d92ba55a9efadf59a27089e
Author: HyukjinKwon 
AuthorDate: Thu Sep 5 18:34:44 2019 +0900

[SPARK-28971][SQL][PYTHON][TESTS] Convert and port 
'pgSQL/aggregates_part4.sql' into UDF test base

### What changes were proposed in this pull request?

This PR proposes to port `pgSQL/aggregates_part4.sql` into UDF test base.

Diff comparing to 'pgSQL/aggregates_part3.sql'


```diff
```




### Why are the changes needed?

To improve test coverage in UDFs.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested via:

```bash
 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/pgSQL/udf-aggregates_part4.sql"
```

as guided in https://issues.apache.org/jira/browse/SPARK-27921

Closes #25677 from HyukjinKwon/SPARK-28971.

Authored-by: HyukjinKwon 
Signed-off-by: HyukjinKwon 
---
 .../inputs/udf/pgSQL/udf-aggregates_part4.sql  | 421 +
 .../results/udf/pgSQL/udf-aggregates_part4.sql.out |   5 +
 2 files changed, 426 insertions(+)

diff --git 
a/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part4.sql
 
b/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part4.sql
new file mode 100644
index 000..7c3
--- /dev/null
+++ 
b/sql/core/src/test/resources/sql-tests/inputs/udf/pgSQL/udf-aggregates_part4.sql
@@ -0,0 +1,421 @@
+--
+-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+--
+--
+-- AGGREGATES [Part 4]
+-- 
https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/aggregates.sql#L607-L997
+
+-- This test file was converted from pgSQL/aggregates_part4.sql.
+
+-- [SPARK-27980] Ordered-Set Aggregate Functions
+-- ordered-set aggregates
+
+-- select p, percentile_cont(p) within group (order by x::float8)
+-- from generate_series(1,5) x,
+--  (values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) 
v(p)
+-- group by p order by p;
+
+-- select p, percentile_cont(p order by p) within group (order by x)  -- error
+-- from generate_series(1,5) x,
+--  (values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) 
v(p)
+-- group by p order by p;
+
+-- select p, sum() within group (order by x::float8)  -- error
+-- from generate_series(1,5) x,
+--  (values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) 
v(p)
+-- group by p order by p;
+
+-- select p, percentile_cont(p,p)  -- error
+-- from generate_series(1,5) x,
+--  (values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) 
v(p)
+-- group by p order by p;
+
+-- select percentile_cont(0.5) within group (order by b) from aggtest;
+-- select percentile_cont(0.5) within group (order by b), sum(b) from aggtest;
+-- select percentile_cont(0.5) within group (order by thousand) from tenk1;
+-- select percentile_disc(0.5) within group (order by thousand) from tenk1;
+-- [SPARK-28661] Hypothetical-Set Aggregate Functions
+-- select rank(3) within group (order by x)
+-- from (values (1),(1),(2),(2),(3),(3),(4)) v(x);
+-- select cume_dist(3) within group (order by x)
+-- from (values (1),(1),(2),(2),(3),(3),(4)) v(x);
+-- select percent_rank(3) within group (order by x)
+-- from (values (1),(1),(2),(2),(3),(3),(4),(5)) v(x);
+-- select dense_rank(3) within group (order by x)
+-- from (values (1),(1),(2),(2),(3),(3),(4)) v(x);
+
+-- [SPARK-27980] Ordered-Set Aggregate Functions
+-- select percentile_disc(array[0,0.1,0.25,0.5,0.75,0.9,1]) within group 
(order by thousand)
+-- from tenk1;
+-- select percentile_cont(array[0,0.25,0.5,0.75,1]) within group (order by 
thousand)
+-- from tenk1;
+-- select percentile_disc(array[[null,1,0.5],[0.75,0.25,null]]) within group 
(order by thousand)
+-- from tenk1;
+-- select percentile_cont(array[0,1,0.25,0.75,0.5,1,0.3,0.32,0.35,0.38,0.4]) 
within group (order by x)
+-- from generate_series(1,6) x;
+
+-- [SPARK-27980] Ordered-Set Aggregate Functions
+-- [SPARK-28382] Array Functions: unnest
+-- select ten, mode() within group (order by string4) from tenk1 group by ten;
+
+-- select percentile_disc(array[0.25,0.5,0.75]) within group (order by x)
+-- from 
unnest('{fred,jim,fred,jack,jill,fred,jill,jim,jim,sheila,jim,sheila}'::text[]) 
u(x);
+
+-- [SPARK-28669] System Information Functions
+-- check collation propagates up in suitable cases:
+-- select pg_collation_for(percentile_disc(1) within group (order by x collate 
"POSIX"))
+--   from (values (

[spark] branch master updated (be04c97 -> 103d50b)

2019-09-05 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from be04c97  [SPARK-28971][SQL][PYTHON][TESTS] Convert and port 
'pgSQL/aggregates_part4.sql' into UDF test base
 add 103d50b  [SPARK-28272][SQL][PYTHON][TESTS] Convert and port 
'pgSQL/aggregates_part3.sql' into UDF test base

No new revisions were added by this update.

Summary of changes:
 .../aggregates_part3.sql => udf/pgSQL/udf-aggregates_part3.sql}   | 8 +---
 .../pgSQL/udf-aggregates_part3.sql.out}   | 8 
 2 files changed, 9 insertions(+), 7 deletions(-)
 copy sql/core/src/test/resources/sql-tests/inputs/{pgSQL/aggregates_part3.sql 
=> udf/pgSQL/udf-aggregates_part3.sql} (98%)
 copy 
sql/core/src/test/resources/sql-tests/results/{pgSQL/aggregates_part3.sql.out 
=> udf/pgSQL/udf-aggregates_part3.sql.out} (74%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (f8bc91f -> 36559b6)

2019-09-05 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from f8bc91f  [SPARK-28782][SQL] Generator support in aggregate expressions
 add 36559b6  [SPARK-28977][DOCS][SQL] Fix DataFrameReader.json docs to doc 
that partition column can be numeric, date or timestamp type

No new revisions were added by this update.

Summary of changes:
 R/pkg/R/SQLContext.R   | 3 ++-
 python/pyspark/sql/readwriter.py   | 3 ++-
 sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 3 ++-
 3 files changed, 6 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress

2019-07-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 992b1bb  [MINOR][CORE][DOCS] Fix inconsistent description of 
showConsoleProgress
992b1bb is described below

commit 992b1bb3697f6ec9fb5fc2853c33b591715dfd76
Author: gengjiaan 
AuthorDate: Wed Jul 31 12:17:44 2019 +0900

[MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress

## What changes were proposed in this pull request?

The latest docs http://spark.apache.org/docs/latest/configuration.html 
contains some description as below:

spark.ui.showConsoleProgress | true | Show the progress bar in the console. 
The progress bar shows the progress of stages that run for longer than 500ms. 
If multiple stages run at the same time, multiple progress bars will be 
displayed on the same line.
-- | -- | --

But the class `org.apache.spark.internal.config.UI` define the config 
`spark.ui.showConsoleProgress` as below:
```
val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress")
.doc("When true, show the progress bar in the console.")
.booleanConf
.createWithDefault(false)
```
So I think there are exists some little mistake and lead to confuse reader.

## How was this patch tested?

No need UT.

Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress.

Authored-by: gengjiaan 
Signed-off-by: HyukjinKwon 
(cherry picked from commit dba4375359a2dfed1f009edc3b1bcf6b3253fe02)
Signed-off-by: HyukjinKwon 
---
 docs/configuration.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 73bae77..d481986 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -891,11 +891,13 @@ Apart from these, the following properties are also 
available, and may be useful
 
 
   spark.ui.showConsoleProgress
-  true
+  false
   
 Show the progress bar in the console. The progress bar shows the progress 
of stages
 that run for longer than 500ms. If multiple stages run at the same time, 
multiple
 progress bars will be displayed on the same line.
+
+Note: In shell environment, the default value of 
spark.ui.showConsoleProgress is true.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress

2019-07-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dba4375  [MINOR][CORE][DOCS] Fix inconsistent description of 
showConsoleProgress
dba4375 is described below

commit dba4375359a2dfed1f009edc3b1bcf6b3253fe02
Author: gengjiaan 
AuthorDate: Wed Jul 31 12:17:44 2019 +0900

[MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress

## What changes were proposed in this pull request?

The latest docs http://spark.apache.org/docs/latest/configuration.html 
contains some description as below:

spark.ui.showConsoleProgress | true | Show the progress bar in the console. 
The progress bar shows the progress of stages that run for longer than 500ms. 
If multiple stages run at the same time, multiple progress bars will be 
displayed on the same line.
-- | -- | --

But the class `org.apache.spark.internal.config.UI` define the config 
`spark.ui.showConsoleProgress` as below:
```
val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress")
.doc("When true, show the progress bar in the console.")
.booleanConf
.createWithDefault(false)
```
So I think there are exists some little mistake and lead to confuse reader.

## How was this patch tested?

No need UT.

Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress.

Authored-by: gengjiaan 
Signed-off-by: HyukjinKwon 
---
 docs/configuration.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 10886241..93a4fcc 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1062,11 +1062,13 @@ Apart from these, the following properties are also 
available, and may be useful
 
 
   spark.ui.showConsoleProgress
-  true
+  false
   
 Show the progress bar in the console. The progress bar shows the progress 
of stages
 that run for longer than 500ms. If multiple stages run at the same time, 
multiple
 progress bars will be displayed on the same line.
+
+Note: In shell environment, the default value of 
spark.ui.showConsoleProgress is true.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.3 updated: [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress

2019-07-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.3 by this push:
 new 78d1bb1  [MINOR][CORE][DOCS] Fix inconsistent description of 
showConsoleProgress
78d1bb1 is described below

commit 78d1bb188efa55038f63ece70aaa0a5ebaa75f5f
Author: gengjiaan 
AuthorDate: Wed Jul 31 12:17:44 2019 +0900

[MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress

## What changes were proposed in this pull request?

The latest docs http://spark.apache.org/docs/latest/configuration.html 
contains some description as below:

spark.ui.showConsoleProgress | true | Show the progress bar in the console. 
The progress bar shows the progress of stages that run for longer than 500ms. 
If multiple stages run at the same time, multiple progress bars will be 
displayed on the same line.
-- | -- | --

But the class `org.apache.spark.internal.config.UI` define the config 
`spark.ui.showConsoleProgress` as below:
```
val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress")
.doc("When true, show the progress bar in the console.")
.booleanConf
.createWithDefault(false)
```
So I think there are exists some little mistake and lead to confuse reader.

## How was this patch tested?

No need UT.

Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress.

Authored-by: gengjiaan 
Signed-off-by: HyukjinKwon 
(cherry picked from commit dba4375359a2dfed1f009edc3b1bcf6b3253fe02)
Signed-off-by: HyukjinKwon 
---
 docs/configuration.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 42e6d87..822f903 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -867,11 +867,13 @@ Apart from these, the following properties are also 
available, and may be useful
 
 
   spark.ui.showConsoleProgress
-  true
+  false
   
 Show the progress bar in the console. The progress bar shows the progress 
of stages
 that run for longer than 500ms. If multiple stages run at the same time, 
multiple
 progress bars will be displayed on the same line.
+
+Note: In shell environment, the default value of 
spark.ui.showConsoleProgress is true.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28038][SQL][TEST] Port text.sql

2019-07-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 261e113  [SPARK-28038][SQL][TEST] Port text.sql
261e113 is described below

commit 261e113449cf1ae84feb073caff01cc27cb5d10f
Author: Yuming Wang 
AuthorDate: Wed Jul 31 11:36:26 2019 +0900

[SPARK-28038][SQL][TEST] Port text.sql

## What changes were proposed in this pull request?

This PR is to port text.sql from PostgreSQL regression tests. 
https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/text.sql

The expected results can be found in the link: 
https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/text.out

When porting the test cases, found a PostgreSQL specific features that do 
not exist in Spark SQL:
[SPARK-28037](https://issues.apache.org/jira/browse/SPARK-28037): Add 
built-in String Functions: quote_literal

Also, found three inconsistent behavior:
[SPARK-27930](https://issues.apache.org/jira/browse/SPARK-27930): Spark 
SQL's format_string can not fully support PostgreSQL's format
[SPARK-28036](https://issues.apache.org/jira/browse/SPARK-28036):  Built-in 
udf left/right has inconsistent behavior
[SPARK-28033](https://issues.apache.org/jira/browse/SPARK-28033): String 
concatenation should low priority than other operators

## How was this patch tested?

N/A

Closes #24862 from wangyum/SPARK-28038.

Authored-by: Yuming Wang 
Signed-off-by: HyukjinKwon 
---
 .../test/resources/sql-tests/inputs/pgSQL/text.sql | 137 
 .../resources/sql-tests/results/pgSQL/text.sql.out | 375 +
 2 files changed, 512 insertions(+)

diff --git a/sql/core/src/test/resources/sql-tests/inputs/pgSQL/text.sql 
b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/text.sql
new file mode 100644
index 000..04d3acc
--- /dev/null
+++ b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/text.sql
@@ -0,0 +1,137 @@
+--
+-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+--
+--
+-- TEXT
+-- 
https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/text.sql
+
+SELECT string('this is a text string') = string('this is a text string') AS 
true;
+
+SELECT string('this is a text string') = string('this is a text strin') AS 
`false`;
+
+CREATE TABLE TEXT_TBL (f1 string) USING parquet;
+
+INSERT INTO TEXT_TBL VALUES ('doh!');
+INSERT INTO TEXT_TBL VALUES ('hi de ho neighbor');
+
+SELECT '' AS two, * FROM TEXT_TBL;
+
+-- As of 8.3 we have removed most implicit casts to text, so that for example
+-- this no longer works:
+-- Spark SQL implicit cast integer to string
+select length(42);
+
+-- But as a special exception for usability's sake, we still allow implicit
+-- casting to text in concatenations, so long as the other input is text or
+-- an unknown literal.  So these work:
+-- [SPARK-28033] String concatenation low priority than other arithmeticBinary
+select string('four: ') || 2+2;
+select 'four: ' || 2+2;
+
+-- but not this:
+-- Spark SQL implicit cast both side to string
+select 3 || 4.0;
+
+/*
+ * various string functions
+ */
+select concat('one');
+-- Spark SQL does not support MMDD, we replace it to MMdd
+select concat(1,2,3,'hello',true, false, to_date('20100309','MMdd'));
+select concat_ws('#','one');
+select concat_ws('#',1,2,3,'hello',true, false, 
to_date('20100309','MMdd'));
+select concat_ws(',',10,20,null,30);
+select concat_ws('',10,20,null,30);
+select concat_ws(NULL,10,20,null,30) is null;
+select reverse('abcde');
+-- [SPARK-28036] Built-in udf left/right has inconsistent behavior
+-- [SPARK-28479] Parser error when enabling ANSI mode
+set spark.sql.parser.ansi.enabled=false;
+select i, left('ahoj', i), right('ahoj', i) from range(-5, 6) t(i) order by i;
+set spark.sql.parser.ansi.enabled=true;
+-- [SPARK-28037] Add built-in String Functions: quote_literal
+-- select quote_literal('');
+-- select quote_literal('abc''');
+-- select quote_literal(e'\\');
+
+-- Skip these tests because Spark does not support variadic labeled argument
+-- check variadic labeled argument
+-- select concat(variadic array[1,2,3]);
+-- select concat_ws(',', variadic array[1,2,3]);
+-- select concat_ws(',', variadic NULL::int[]);
+-- select concat(variadic NULL::int[]) is NULL;
+-- select concat(variadic '{}'::int[]) = '';
+--should fail
+-- select concat_ws(',', variadic 10);
+
+-- [SPARK-27930] Replace format to format_string
+/*
+ * format
+ */
+select format_string(NULL);
+select format_string('Hello');
+select format_string('Hello %s', 'World');
+select format_string('Hello %%');
+select format_string('Hello ');
+-- should fail
+select format_string('Hello %s %s', 'World');
+select format_string('Hello %s');
+select format_string('Hello %x', 20

[spark] branch master updated: [SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0

2019-07-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a745381  [SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 
3.0
a745381 is described below

commit a745381b9d3dd290057ef3089de7fdb9264f1f8b
Author: WeichenXu 
AuthorDate: Wed Jul 31 14:26:18 2019 +0900

[SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0

## What changes were proposed in this pull request?

I remove the deprecate `ImageSchema.readImages`.
Move some useful methods from class `ImageSchema` into class 
`ImageFileFormat`.

In pyspark, I rename `ImageSchema` class to be `ImageUtils`, and keep some 
useful python methods in it.

## How was this patch tested?

UT.

Please review https://spark.apache.org/contributing.html before opening a 
pull request.

Closes #25245 from WeichenXu123/remove_image_schema.

Authored-by: WeichenXu 
Signed-off-by: HyukjinKwon 
---
 .../org/apache/spark/ml/image/ImageSchema.scala|  72 -
 .../apache/spark/ml/image/ImageSchemaSuite.scala   | 171 -
 .../ml/source/image/ImageFileFormatSuite.scala |  18 +++
 project/MimaExcludes.scala |   6 +-
 python/pyspark/ml/image.py |  38 -
 python/pyspark/ml/tests/test_image.py  |  29 ++--
 6 files changed, 42 insertions(+), 292 deletions(-)

diff --git a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
index a7ddf2f..0313626 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
@@ -191,76 +191,4 @@ object ImageSchema {
   Some(Row(Row(origin, height, width, nChannels, mode, decoded)))
 }
   }
-
-  /**
-   * Read the directory of images from the local or remote source
-   *
-   * @note If multiple jobs are run in parallel with different sampleRatio or 
recursive flag,
-   * there may be a race condition where one job overwrites the hadoop configs 
of another.
-   * @note If sample ratio is less than 1, sampling uses a PathFilter that is 
efficient but
-   * potentially non-deterministic.
-   *
-   * @param path Path to the image directory
-   * @return DataFrame with a single column "image" of images;
-   * see ImageSchema for the details
-   */
-  @deprecated("use `spark.read.format(\"image\").load(path)` and this 
`readImages` will be " +
-"removed in 3.0.0.", "2.4.0")
-  def readImages(path: String): DataFrame = readImages(path, null, false, -1, 
false, 1.0, 0)
-
-  /**
-   * Read the directory of images from the local or remote source
-   *
-   * @note If multiple jobs are run in parallel with different sampleRatio or 
recursive flag,
-   * there may be a race condition where one job overwrites the hadoop configs 
of another.
-   * @note If sample ratio is less than 1, sampling uses a PathFilter that is 
efficient but
-   * potentially non-deterministic.
-   *
-   * @param path Path to the image directory
-   * @param sparkSession Spark Session, if omitted gets or creates the session
-   * @param recursive Recursive path search flag
-   * @param numPartitions Number of the DataFrame partitions,
-   *  if omitted uses defaultParallelism instead
-   * @param dropImageFailures Drop the files that are not valid images from 
the result
-   * @param sampleRatio Fraction of the files loaded
-   * @return DataFrame with a single column "image" of images;
-   * see ImageSchema for the details
-   */
-  @deprecated("use `spark.read.format(\"image\").load(path)` and this 
`readImages` will be " +
-"removed in 3.0.0.", "2.4.0")
-  def readImages(
-  path: String,
-  sparkSession: SparkSession,
-  recursive: Boolean,
-  numPartitions: Int,
-  dropImageFailures: Boolean,
-  sampleRatio: Double,
-  seed: Long): DataFrame = {
-require(sampleRatio <= 1.0 && sampleRatio >= 0, "sampleRatio should be 
between 0 and 1")
-
-val session = if (sparkSession != null) sparkSession else 
SparkSession.builder().getOrCreate
-val partitions =
-  if (numPartitions > 0) {
-numPartitions
-  } else {
-session.sparkContext.defaultParallelism
-  }
-
-RecursiveFlag.withRecursiveFlag(recursive, session) {
-  SamplePathFilter.withPathFilter(sampleRatio, session, seed) {
-val binResult = session.sparkContext.binaryFiles(path, partitions)
-val streams = if (numPartitions == -1) binResult else 
binResult.repartition(partitions)
-val convert = (origi

[spark] branch master updated: [SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon

2019-07-30 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3b14088  [SPARK-26175][PYTHON] Redirect the standard input of the 
forked child to devnull in daemon
3b14088 is described below

commit 3b140885410362fced9d98fca61d6a357de604af
Author: WeichenXu 
AuthorDate: Wed Jul 31 09:10:24 2019 +0900

[SPARK-26175][PYTHON] Redirect the standard input of the forked child to 
devnull in daemon

## What changes were proposed in this pull request?

PySpark worker daemon reads from stdin the worker PIDs to kill. 
https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127

However, the worker process is a forked process from the worker daemon 
process and we didn't close stdin on the child after fork. This means the child 
and user program can read stdin as well, which blocks daemon from receiving the 
PID to kill. This can cause issues because the task reaper might detect the 
task was not terminated and eventually kill the JVM.

This PR fix this by redirecting the standard input of the forked child to 
devnull.

## How was this patch tested?

Manually test.

In `pyspark`, run:
```
import subprocess
def task(_):
  subprocess.check_output(["cat"])

sc.parallelize(range(1), 1).mapPartitions(task).count()
```

Before:
The job will get stuck and press Ctrl+C to exit the job but the python 
worker process do not exit.
After:
The job finish correctly. The "cat" print nothing (because the dummay stdin 
is "/dev/null").
The python worker process exit normally.

Please review https://spark.apache.org/contributing.html before opening a 
pull request.

Closes #25138 from WeichenXu123/SPARK-26175.

Authored-by: WeichenXu 
Signed-off-by: HyukjinKwon 
---
 python/pyspark/daemon.py | 15 +++
 python/pyspark/sql/tests/test_udf.py | 12 
 2 files changed, 27 insertions(+)

diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py
index 6f42ad3..97b6b25 100644
--- a/python/pyspark/daemon.py
+++ b/python/pyspark/daemon.py
@@ -160,6 +160,21 @@ def manager():
 if pid == 0:
 # in child process
 listen_sock.close()
+
+# It should close the standard input in the child process 
so that
+# Python native function executions stay intact.
+#
+# Note that if we just close the standard input (file 
descriptor 0),
+# the lowest file descriptor (file descriptor 0) will be 
allocated,
+# later when other file descriptors should happen to open.
+#
+# Therefore, here we redirects it to '/dev/null' by 
duplicating
+# another file descriptor for '/dev/null' to the standard 
input (0).
+# See SPARK-26175.
+devnull = open(os.devnull, 'r')
+os.dup2(devnull.fileno(), 0)
+devnull.close()
+
 try:
 # Acknowledge that the fork was successful
 outfile = sock.makefile(mode="wb")
diff --git a/python/pyspark/sql/tests/test_udf.py 
b/python/pyspark/sql/tests/test_udf.py
index 803d471..1999311 100644
--- a/python/pyspark/sql/tests/test_udf.py
+++ b/python/pyspark/sql/tests/test_udf.py
@@ -616,6 +616,18 @@ class UDFTests(ReusedSQLTestCase):
 
 self.spark.range(1).select(f()).collect()
 
+def test_worker_original_stdin_closed(self):
+# Test if it closes the original standard input of worker inherited 
from the daemon,
+# and replaces it with '/dev/null'.  See SPARK-26175.
+def task(iterator):
+import sys
+res = sys.stdin.read()
+# Because the standard input is '/dev/null', it reaches to EOF.
+assert res == '', "Expect read EOF from stdin."
+return iterator
+
+self.sc.parallelize(range(1), 1).mapPartitions(task).count()
+
 
 class UDFInitializationTests(unittest.TestCase):
 def tearDown(self):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][DOC] Fix a typo 'lister' -> 'listener'

2019-08-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e58dd4a  [MINOR][DOC] Fix a typo 'lister' -> 'listener'
e58dd4a is described below

commit e58dd4af60dbb0e3378184c56ae9ded3db288852
Author: Yishuang Lu 
AuthorDate: Thu Aug 8 11:12:18 2019 +0900

[MINOR][DOC] Fix a typo 'lister' -> 'listener'

## What changes were proposed in this pull request?

Fix the typo in java doc.

## How was this patch tested?

N/A

Signed-off-by: Yishuang Lu 

Closes #25377 from lys0716/dev.

Authored-by: Yishuang Lu 
Signed-off-by: HyukjinKwon 
---
 .../spark/sql/execution/datasources/parquet/ParquetFileFormat.scala   | 4 ++--
 .../datasources/v2/parquet/ParquetPartitionReaderFactory.scala| 4 ++--
 .../apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
index 9caa34b..815b62d 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
@@ -315,7 +315,7 @@ class ParquetFileFormat
 val vectorizedReader = new VectorizedParquetRecordReader(
   convertTz.orNull, enableOffHeapColumnVector && 
taskContext.isDefined, capacity)
 val iter = new RecordReaderIterator(vectorizedReader)
-// SPARK-23457 Register a task completion lister before 
`initialization`.
+// SPARK-23457 Register a task completion listener before 
`initialization`.
 taskContext.foreach(_.addTaskCompletionListener[Unit](_ => 
iter.close()))
 vectorizedReader.initialize(split, hadoopAttemptContext)
 logDebug(s"Appending $partitionSchema ${file.partitionValues}")
@@ -337,7 +337,7 @@ class ParquetFileFormat
   new ParquetRecordReader[UnsafeRow](readSupport)
 }
 val iter = new RecordReaderIterator(reader)
-// SPARK-23457 Register a task completion lister before 
`initialization`.
+// SPARK-23457 Register a task completion listener before 
`initialization`.
 taskContext.foreach(_.addTaskCompletionListener[Unit](_ => 
iter.close()))
 reader.initialize(split, hadoopAttemptContext)
 
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
index 4a281ba..a0f19c3 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
@@ -197,7 +197,7 @@ case class ParquetPartitionReaderFactory(
   new ParquetRecordReader[UnsafeRow](readSupport)
 }
 val iter = new RecordReaderIterator(reader)
-// SPARK-23457 Register a task completion lister before `initialization`.
+// SPARK-23457 Register a task completion listener before `initialization`.
 taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close()))
 reader
   }
@@ -219,7 +219,7 @@ case class ParquetPartitionReaderFactory(
 val vectorizedReader = new VectorizedParquetRecordReader(
   convertTz.orNull, enableOffHeapColumnVector && taskContext.isDefined, 
capacity)
 val iter = new RecordReaderIterator(vectorizedReader)
-// SPARK-23457 Register a task completion lister before `initialization`.
+// SPARK-23457 Register a task completion listener before `initialization`.
 taskContext.foreach(_.addTaskCompletionListener[Unit](_ => iter.close()))
 logDebug(s"Appending $partitionSchema $partitionValues")
 vectorizedReader
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala
index da2f221..7801d96 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryListenersConfSuite.scala
@@ -33,7 +33,7 @@ class StreamingQueryListenersConfSuite extends StreamTest 
with BeforeAndAfter {
 super.sparkConf.set(STREAMING_QUERY_LISTENERS.key,
   "org.apache.spark.sql.streaming.TestListener")
 
-  test("test if the configured query lister i

[spark] branch master updated: [SPARK-28454][PYTHON] Validate LongType in `createDataFrame(verifySchema=True)`

2019-08-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bda5b51  [SPARK-28454][PYTHON] Validate LongType in 
`createDataFrame(verifySchema=True)`
bda5b51 is described below

commit bda5b51576e525724315d4892e34c8fa7e27f0c7
Author: Anton Yanchenko 
AuthorDate: Thu Aug 8 11:47:25 2019 +0900

[SPARK-28454][PYTHON] Validate LongType in 
`createDataFrame(verifySchema=True)`

## What changes were proposed in this pull request?

Add missing validation for `LongType` in 
`pyspark.sql.types._make_type_verifier`.

## How was this patch tested?

Doctests / unittests / manual tests.

Unpatched version:
```
In [23]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', 
LongType())])).collect()
Out[23]: [Row(x=None)]
```

Patched:
```
In [5]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', 
LongType())])).collect()
---
ValueErrorTraceback (most recent call last)
 in 
> 1 s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', 
LongType())])).collect()

/usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in 
createDataFrame(self, data, schema, samplingRatio, verifySchema)
689 rdd, schema = self._createFromRDD(data.map(prepare), 
schema, samplingRatio)
690 else:
--> 691 rdd, schema = self._createFromLocal(map(prepare, data), 
schema)
692 jrdd = 
self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
693 jdf = 
self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in 
_createFromLocal(self, data, schema)
405 # make sure data could consumed multiple times
406 if not isinstance(data, list):
--> 407 data = list(data)
408
409 if schema is None or isinstance(schema, (list, tuple)):

/usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in 
prepare(obj)
671
672 def prepare(obj):
--> 673 verify_func(obj)
674 return obj
675 elif isinstance(schema, DataType):

/usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj)
   1427 def verify(obj):
   1428 if not verify_nullability(obj):
-> 1429 verify_value(obj)
   1430
   1431 return verify

/usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in 
verify_struct(obj)
   1397 if isinstance(obj, dict):
   1398 for f, verifier in verifiers:
-> 1399 verifier(obj.get(f))
   1400 elif isinstance(obj, Row) and getattr(obj, 
"__from_dict__", False):
   1401 # the order in obj could be different than 
dataType.fields

/usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj)
   1427 def verify(obj):
   1428 if not verify_nullability(obj):
-> 1429 verify_value(obj)
   1430
   1431 return verify

/usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in 
verify_long(obj)
   1356 if obj < -9223372036854775808 or obj > 
9223372036854775807:
   1357 raise ValueError(
-> 1358 new_msg("object of LongType out of range, got: 
%s" % obj))
   1359
   1360 verify_value = verify_long

ValueError: field x: object of LongType out of range, got: 
18446744073709551616
```

Closes #25117 from simplylizz/master.

Authored-by: Anton Yanchenko 
Signed-off-by: HyukjinKwon 
---
 docs/sql-migration-guide-upgrade.md|  2 ++
 python/pyspark/sql/tests/test_types.py |  3 ++-
 python/pyspark/sql/types.py| 14 ++
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/docs/sql-migration-guide-upgrade.md 
b/docs/sql-migration-guide-upgrade.md
index efd88d0..b2bd8ce 100644
--- a/docs/sql-migration-guide-upgrade.md
+++ b/docs/sql-migration-guide-upgrade.md
@@ -149,6 +149,8 @@ license: |
 
   - Since Spark 3.0, if files or subdirectories disappear during recursive 
directory listing (i.e. they appear in an intermediate listing but then cannot 
be read or listed during later phases of the recursive directory listing, due 
to either concurrent file deletions or object store consistency issues) then 
the listing will fail with an exception unless 
`spark.sql.files.ig

[spark] branch master updated: [SPARK-28395][FOLLOW-UP][SQL] Make spark.sql.function.preferIntegralDivision internal

2019-08-07 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3586cdd  [SPARK-28395][FOLLOW-UP][SQL] Make 
spark.sql.function.preferIntegralDivision internal
3586cdd is described below

commit 3586cdd24d9f5cb7d3f642a3da6a26ced1f88cea
Author: Yuming Wang 
AuthorDate: Thu Aug 8 10:42:24 2019 +0900

[SPARK-28395][FOLLOW-UP][SQL] Make 
spark.sql.function.preferIntegralDivision internal

## What changes were proposed in this pull request?

This PR makes `spark.sql.function.preferIntegralDivision` to internal 
configuration because it is only used for PostgreSQL test cases.

More details:
https://github.com/apache/spark/pull/25158#discussion_r309764541

## How was this patch tested?

N/A

Closes #25376 from wangyum/SPARK-28395-2.

Authored-by: Yuming Wang 
Signed-off-by: HyukjinKwon 
---
 .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 80a7d4e..f779bc8 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -1532,8 +1532,9 @@ object SQLConf {
 .createWithDefault(false)
 
   val PREFER_INTEGRAL_DIVISION = 
buildConf("spark.sql.function.preferIntegralDivision")
+.internal()
 .doc("When true, will perform integral division with the / operator " +
-  "if both sides are integral types.")
+  "if both sides are integral types. This is for PostgreSQL test cases 
only.")
 .booleanConf
 .createWithDefault(false)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-28474][SQL] Hive 0.12 JDBC client can not handle binary type

2019-08-08 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c4acfe7  [SPARK-28474][SQL] Hive 0.12 JDBC client can not handle 
binary type
c4acfe7 is described below

commit c4acfe7761bde41e9d26c5ab3a670ab86165cf9b
Author: Yuming Wang 
AuthorDate: Thu Aug 8 17:01:25 2019 +0900

[SPARK-28474][SQL] Hive 0.12 JDBC client can not handle binary type

## What changes were proposed in this pull request?

This PR fix Hive 0.12 JDBC client can not handle binary type:
```sql
Connected to: Hive (version 3.0.0-SNAPSHOT)
Driver: Hive (version 0.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 0.12.0 by Apache Hive
0: jdbc:hive2://localhost:1> SELECT cast('ABC' as binary);
Error: java.lang.ClassCastException: [B incompatible with java.lang.String 
(state=,code=0)
```

Server log:
```
19/08/07 10:10:04 WARN ThriftCLIService: Error fetching results:
java.lang.RuntimeException: java.lang.ClassCastException: [B incompatible 
with java.lang.String
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
at 
java.security.AccessController.doPrivileged(AccessController.java:770)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
at com.sun.proxy.$Proxy26.fetchResults(Unknown Source)
at 
org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:455)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:819)
Caused by: java.lang.ClassCastException: [B incompatible with 
java.lang.String
at 
org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:198)
at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$getNextRowSet$1(SparkExecuteStatementOperation.scala:151)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$Lambda$1923.9113BFE0.apply(Unknown
 Source)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withSchedulerPool(SparkExecuteStatementOperation.scala:299)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:113)
at 
org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:785)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
... 18 more
```

## How was this patch tested?

unit tests

Closes #25379 from wangyum/SPARK-28474.

Authored-by: Yuming Wang 
Signed-off-by: HyukjinKwon 
---
 .../SparkThriftServerProtocolVersionsSuite.scala   | 14 --
 .../main/java/org/apache/hive/service/cli/ColumnValue.java |  5 -
 .../main/java/org/apache/hive/service/cli/ColumnValue.java |  5 -
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git 
a/sql/hive-thriftserver/

[spark] branch master updated (c4acfe7 -> 1941d35)

2019-08-08 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c4acfe7  [SPARK-28474][SQL] Hive 0.12 JDBC client can not handle 
binary type
 add 1941d35  [SPARK-28644][SQL] Port HIVE-10646: ColumnValue does not 
handle NULL_TYPE

No new revisions were added by this update.

Summary of changes:
 .../sql/hive/thriftserver/SparkThriftServerProtocolVersionsSuite.scala | 3 +--
 .../v1.2.1/src/main/java/org/apache/hive/service/cli/ColumnValue.java  | 2 ++
 2 files changed, 3 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 8779 matches

Mail list logo