[spark] branch branch-2.3 updated: [SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable to 3.6.x

2019-04-19 Thread shaneknapp
This is an automated email from the ASF dual-hosted git repository.

shaneknapp pushed a commit to branch branch-2.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.3 by this push:
 new a85ab12  [SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable 
to 3.6.x
a85ab12 is described below

commit a85ab120e3d29a323e6d28aa307d4c20ee5f2c6c
Author: shane knapp 
AuthorDate: Fri Apr 19 09:45:40 2019 -0700

[SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable to 3.6.x

## What changes were proposed in this pull request?

have jenkins test against python3.6 (instead of 3.4).

## How was this patch tested?

extensive testing on both the centos and ubuntu jenkins workers revealed 
that 2.3 probably doesn't like python 3.6... :(

NOTE: this is just for branch-2.3

PLEASE DO NOT MERGE

Author: shane knapp 

Closes #24380 from shaneknapp/update-python-executable-2.3.
---
 python/run-tests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index 3539c76..d571855 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -114,7 +114,7 @@ def run_individual_python_test(test_name, pyspark_python):
 
 
 def get_default_python_executables():
-python_execs = [x for x in ["python2.7", "python3.4", "pypy"] if which(x)]
+python_execs = [x for x in ["python2.7", "python3.6", "pypy"] if which(x)]
 if "python2.7" not in python_execs:
 LOGGER.warning("Not testing against `python2.7` because it could not 
be found; falling"
" back to `python` instead")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-2.4 updated: [SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable to 3.6.x

2019-04-19 Thread shaneknapp
This is an automated email from the ASF dual-hosted git repository.

shaneknapp pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new eaa88ae  [SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable 
to 3.6.x
eaa88ae is described below

commit eaa88ae5237b23fb7497838f3897a64641efe383
Author: shane knapp 
AuthorDate: Fri Apr 19 09:44:06 2019 -0700

[SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable to 3.6.x

## What changes were proposed in this pull request?

have jenkins test against python3.6 (instead of 3.4).

## How was this patch tested?

extensive testing on both the centos and ubuntu jenkins workers revealed 
that 2.4 doesn't like python 3.6...  :(

NOTE: this is just for branch-2.4

PLEASE DO NOT MERGE

Closes #24379 from shaneknapp/update-python-executable.

Authored-by: shane knapp 
Signed-off-by: shane knapp 
---
 python/run-tests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/run-tests.py b/python/run-tests.py
index ccbdfac..921fdc9 100755
--- a/python/run-tests.py
+++ b/python/run-tests.py
@@ -162,7 +162,7 @@ def run_individual_python_test(target_dir, test_name, 
pyspark_python):
 
 
 def get_default_python_executables():
-python_execs = [x for x in ["python2.7", "python3.4", "pypy"] if which(x)]
+python_execs = [x for x in ["python2.7", "python3.6", "pypy"] if which(x)]
 if "python2.7" not in python_execs:
 LOGGER.warning("Not testing against `python2.7` because it could not 
be found; falling"
" back to `python` instead")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2

2019-04-19 Thread lixiao
This is an automated email from the ASF dual-hosted git repository.

lixiao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 777b450  [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 
for hadoop-3.2
777b450 is described below

commit 777b4502b206b7240c6655d3c0b0a9ce08f6a09c
Author: Yuming Wang 
AuthorDate: Fri Apr 19 08:59:08 2019 -0700

[SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2

## What changes were proposed in this pull request?

When we compile and test Hadoop 3.2, we will hint the following two issues:
1. JobSummaryLevel is not a member of object 
org.apache.parquet.hadoop.ParquetOutputFormat. Fixed by 
[PARQUET-381](https://issues.apache.org/jira/browse/PARQUET-381)(Parquet 1.9.0)
2. java.lang.NoSuchFieldError: BROTLI
at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31).
 Fixed by 
[PARQUET-1143](https://issues.apache.org/jira/browse/PARQUET-1143)(Parquet 
1.10.0)

The reason is that the `parquet-hadoop-bundle-1.8.1.jar` conflicts with 
Parquet 1.10.1.
I think it would be safe to upgrade Hive's parquet to 1.10.1 to workaround 
this issue.

This is what Hive did when upgrading Parquet 1.8.1 to 1.10.0: 
[HIVE-17000](https://issues.apache.org/jira/browse/HIVE-17000) and 
[HIVE-19464](https://issues.apache.org/jira/browse/HIVE-19464). We can see that 
all changes are related to vectors, and vectors are disabled by default: see 
[HIVE-14826](https://issues.apache.org/jira/browse/HIVE-14826) and 
[HiveConf.java#L2723](https://github.com/apache/hive/blob/rel/release-2.3.4/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2723).

This pr removes 
[parquet-hadoop-bundle-1.8.1.jar](https://github.com/apache/parquet-mr/tree/master/parquet-hadoop-bundle)
 , so Hive serde will use [parquet-common-1.10.1.jar, parquet-column-1.10.1.jar 
and 
parquet-hadoop-1.10.1.jar](https://github.com/apache/spark/blob/master/dev/deps/spark-deps-hadoop-3.2#L185-L189).

## How was this patch tested?

1. manual tests
2. [upgrade Hive Parquet to 1.10.1 annd run Hadoop 3.2 test on 
jenkins](https://github.com/apache/spark/pull/24044#commits-pushed-0c3f962)

Closes #24346 from wangyum/SPARK-27176.

Authored-by: Yuming Wang 
Signed-off-by: gatorsmile 
---
 dev/deps/spark-deps-hadoop-3.2 | 1 -
 pom.xml| 8 +---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3.2 b/dev/deps/spark-deps-hadoop-3.2
index a45f02d..8b3bd79 100644
--- a/dev/deps/spark-deps-hadoop-3.2
+++ b/dev/deps/spark-deps-hadoop-3.2
@@ -187,7 +187,6 @@ parquet-common-1.10.1.jar
 parquet-encoding-1.10.1.jar
 parquet-format-2.4.0.jar
 parquet-hadoop-1.10.1.jar
-parquet-hadoop-bundle-1.6.0.jar
 parquet-jackson-1.10.1.jar
 protobuf-java-2.5.0.jar
 py4j-0.10.8.1.jar
diff --git a/pom.xml b/pom.xml
index fce4cbd..5879a76 100644
--- a/pom.xml
+++ b/pom.xml
@@ -221,6 +221,7 @@
 -->
 compile
 compile
+${hive.deps.scope}
 compile
 compile
 test
@@ -2004,7 +2005,7 @@
 ${hive.parquet.group}
 parquet-hadoop-bundle
 ${hive.parquet.version}
-compile
+${hive.parquet.scope}
   
   
 org.codehaus.janino
@@ -2818,8 +2819,9 @@
 core
 ${hive23.version}
 2.3.4
-org.apache.parquet
-1.8.1
+
+provided
 
 4.1.17
   


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-27486][CORE][TEST] Enable History server storage information test in the HistoryServerSuite

2019-04-19 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 16bbe0f  [SPARK-27486][CORE][TEST] Enable History server storage 
information test in the HistoryServerSuite
16bbe0f is described below

commit 16bbe0f798d2a75ba0a9f2daff709292f6971755
Author: Shahid 
AuthorDate: Fri Apr 19 08:12:20 2019 -0700

[SPARK-27486][CORE][TEST] Enable History server storage information test in 
the HistoryServerSuite

## What changes were proposed in this pull request?

We have disabled a test related to storage in the History server suite 
after SPARK-13845. But, after SPARK-22050, we can store the information about 
block updated events to eventLog, if we enable 
"spark.eventLog.logBlockUpdates.enabled=true".

   So, we can enable the test, by adding an eventlog corresponding to the 
application, which has enabled the configuration, 
"spark.eventLog.logBlockUpdates.enabled=true"

## How was this patch tested?
Existing UTs

Closes #24390 from shahidki31/enableRddStorageTest.

Authored-by: Shahid 
Signed-off-by: Sean Owen 
---
 .../one_rdd_storage_json_expectation.json  | 10 ++
 .../org/apache/spark/deploy/history/HistoryServerSuite.scala   |  8 +---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git 
a/core/src/test/resources/HistoryServerExpectations/one_rdd_storage_json_expectation.json
 
b/core/src/test/resources/HistoryServerExpectations/one_rdd_storage_json_expectation.json
new file mode 100644
index 000..09afdf5
--- /dev/null
+++ 
b/core/src/test/resources/HistoryServerExpectations/one_rdd_storage_json_expectation.json
@@ -0,0 +1,10 @@
+{
+  "id" : 0,
+  "name" : "0",
+  "numPartitions" : 8,
+  "numCachedPartitions" : 0,
+  "storageLevel" : "Memory Deserialized 1x Replicated",
+  "memoryUsed" : 0,
+  "diskUsed" : 0,
+  "partitions" : [ ]
+}
diff --git 
a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala 
b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
index c99bcf2..5ba27a0 100644
--- 
a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
@@ -174,9 +174,11 @@ class HistoryServerSuite extends SparkFunSuite with 
BeforeAndAfter with Matchers
 "executor node blacklisting unblacklisting" -> 
"applications/app-20161115172038-/executors",
 "executor memory usage" -> 
"applications/app-20161116163331-/executors",
 
-"app environment" -> "applications/app-20161116163331-/environment"
-// Todo: enable this test when logging the even of onBlockUpdated. See: 
SPARK-13845
-// "one rdd storage json" -> 
"applications/local-1422981780767/storage/rdd/0"
+"app environment" -> "applications/app-20161116163331-/environment",
+
+// Enable "spark.eventLog.logBlockUpdates.enabled", to get the storage 
information
+// in the history server.
+"one rdd storage json" -> "applications/local-1422981780767/storage/rdd/0"
   )
 
   // run a bunch of characterization tests -- just verify the behavior is the 
same as what is saved


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-27504][SQL] File source V2: support refreshing metadata cache

2019-04-19 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 31488e1  [SPARK-27504][SQL] File source V2: support refreshing 
metadata cache
31488e1 is described below

commit 31488e1ca506efd34459e6bc9a08b6d0956c8d44
Author: Gengliang Wang 
AuthorDate: Fri Apr 19 18:26:03 2019 +0800

[SPARK-27504][SQL] File source V2: support refreshing metadata cache

## What changes were proposed in this pull request?

In file source V1, if some file is deleted manually, reading the 
DataFrame/Table will throws an exception with suggestion message
```
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
```
After refreshing the table/DataFrame, the reads should return correct 
results.

We should follow it in file source V2 as well.
## How was this patch tested?
Unit test

Closes #24401 from gengliangwang/refreshFileTable.

Authored-by: Gengliang Wang 
Signed-off-by: Wenchen Fan 
---
 .../datasources/v2/DataSourceV2Relation.scala  |  5 +++
 .../datasources/v2/FilePartitionReader.scala   | 38 --
 .../org/apache/spark/sql/MetadataCacheSuite.scala  | 35 ++--
 3 files changed, 50 insertions(+), 28 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
index 4119957..e7e0be0 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
@@ -66,6 +66,11 @@ case class DataSourceV2Relation(
   override def newInstance(): DataSourceV2Relation = {
 copy(output = output.map(_.newInstance()))
   }
+
+  override def refresh(): Unit = table match {
+case table: FileTable => table.fileIndex.refresh()
+case _ => // Do nothing.
+  }
 }
 
 /**
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala
index 7c7b468..d4bad29 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala
@@ -34,25 +34,27 @@ class FilePartitionReader[T](readers: 
Iterator[PartitionedFileReader[T]])
   override def next(): Boolean = {
 if (currentReader == null) {
   if (readers.hasNext) {
-if (ignoreMissingFiles || ignoreCorruptFiles) {
-  try {
-currentReader = getNextReader()
-  } catch {
-case e: FileNotFoundException if ignoreMissingFiles =>
-  logWarning(s"Skipped missing file: $currentReader", e)
-  currentReader = null
-  return false
-// Throw FileNotFoundException even if `ignoreCorruptFiles` is true
-case e: FileNotFoundException if !ignoreMissingFiles => throw e
-case e @ (_: RuntimeException | _: IOException) if 
ignoreCorruptFiles =>
-  logWarning(
-s"Skipped the rest of the content in the corrupted file: 
$currentReader", e)
-  currentReader = null
-  InputFileBlockHolder.unset()
-  return false
-  }
-} else {
+try {
   currentReader = getNextReader()
+} catch {
+  case e: FileNotFoundException if ignoreMissingFiles =>
+logWarning(s"Skipped missing file: $currentReader", e)
+currentReader = null
+return false
+  // Throw FileNotFoundException even if `ignoreCorruptFiles` is true
+  case e: FileNotFoundException if !ignoreMissingFiles =>
+throw new FileNotFoundException(
+  e.getMessage + "\n" +
+"It is possible the underlying files have been updated. " +
+"You can explicitly invalidate the cache in Spark by " +
+"running 'REFRESH TABLE tableName' command in SQL or " +
+"by recreating the Dataset/DataFrame involved.")
+  case e @ (_: RuntimeException | _: IOException) if 
ignoreCorruptFiles =>
+logWarning(
+  s"Skipped the rest of the content in the corrupted file: 
$currentReader", e)
+currentReader = null
+InputFileBlockHolder.unset()
+return false
 }
   } else {
 return false
diff --git 

[spark] branch master updated: [SPARK-27514] Skip collapsing windows with empty window expressions

2019-04-19 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 163a6e2  [SPARK-27514] Skip collapsing windows with empty window 
expressions
163a6e2 is described below

commit 163a6e298213f216f74f4764e241ee6298ea30b6
Author: Yifei Huang 
AuthorDate: Fri Apr 19 14:04:44 2019 +0800

[SPARK-27514] Skip collapsing windows with empty window expressions

## What changes were proposed in this pull request?

A previous change moved the removal of empty window expressions to the 
RemoveNoopOperations rule, which comes after the CollapseWindow rule. 
Therefore, by the time we get to CollapseWindow, we aren't guaranteed that 
empty windows have been removed. This change checks that the window expressions 
are not empty, and only collapses the windows if both windows are non-empty.

A lengthier description and repro steps here: 
https://issues.apache.org/jira/browse/SPARK-27514

## How was this patch tested?

A unit test, plus I reran the breaking case mentioned in the Jira ticket.

Closes #24411 from yifeih/yh/spark-27514.

Authored-by: Yifei Huang 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/catalyst/optimizer/Optimizer.scala   |  1 +
 .../spark/sql/catalyst/optimizer/CollapseWindowSuite.scala| 11 +++
 2 files changed, 12 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index afdf61e..f32f2c7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -770,6 +770,7 @@ object CollapseWindow extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
 case w1 @ Window(we1, ps1, os1, w2 @ Window(we2, ps2, os2, grandChild))
 if ps1 == ps2 && os1 == os2 && 
w1.references.intersect(w2.windowOutputSet).isEmpty &&
+  we1.nonEmpty && we2.nonEmpty &&
   // This assumes Window contains the same type of window expressions. 
This is ensured
   // by ExtractWindowFunctions.
   WindowFunctionType.functionType(we1.head) == 
WindowFunctionType.functionType(we2.head) =>
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseWindowSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseWindowSuite.scala
index 52054c2..3b3b490 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseWindowSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseWindowSuite.scala
@@ -89,4 +89,15 @@ class CollapseWindowSuite extends PlanTest {
 val optimized = Optimize.execute(query.analyze)
 comparePlans(optimized, expected)
   }
+
+  test("Skip windows with empty window expressions") {
+val query = testRelation
+  .window(Seq(), partitionSpec1, orderSpec1)
+  .window(Seq(sum(a).as('sum_a)), partitionSpec1, orderSpec1)
+
+val optimized = Optimize.execute(query.analyze)
+val correctAnswer = query.analyze
+
+comparePlans(optimized, correctAnswer)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org