[spark] branch branch-2.3 updated: [SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable to 3.6.x
This is an automated email from the ASF dual-hosted git repository. shaneknapp pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.3 by this push: new a85ab12 [SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable to 3.6.x a85ab12 is described below commit a85ab120e3d29a323e6d28aa307d4c20ee5f2c6c Author: shane knapp AuthorDate: Fri Apr 19 09:45:40 2019 -0700 [SPARK-25079][PYTHON][BRANCH-2.3] update python3 executable to 3.6.x ## What changes were proposed in this pull request? have jenkins test against python3.6 (instead of 3.4). ## How was this patch tested? extensive testing on both the centos and ubuntu jenkins workers revealed that 2.3 probably doesn't like python 3.6... :( NOTE: this is just for branch-2.3 PLEASE DO NOT MERGE Author: shane knapp Closes #24380 from shaneknapp/update-python-executable-2.3. --- python/run-tests.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/run-tests.py b/python/run-tests.py index 3539c76..d571855 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -114,7 +114,7 @@ def run_individual_python_test(test_name, pyspark_python): def get_default_python_executables(): -python_execs = [x for x in ["python2.7", "python3.4", "pypy"] if which(x)] +python_execs = [x for x in ["python2.7", "python3.6", "pypy"] if which(x)] if "python2.7" not in python_execs: LOGGER.warning("Not testing against `python2.7` because it could not be found; falling" " back to `python` instead") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable to 3.6.x
This is an automated email from the ASF dual-hosted git repository. shaneknapp pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new eaa88ae [SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable to 3.6.x eaa88ae is described below commit eaa88ae5237b23fb7497838f3897a64641efe383 Author: shane knapp AuthorDate: Fri Apr 19 09:44:06 2019 -0700 [SPARK-25079][PYTHON][BRANCH-2.4] update python3 executable to 3.6.x ## What changes were proposed in this pull request? have jenkins test against python3.6 (instead of 3.4). ## How was this patch tested? extensive testing on both the centos and ubuntu jenkins workers revealed that 2.4 doesn't like python 3.6... :( NOTE: this is just for branch-2.4 PLEASE DO NOT MERGE Closes #24379 from shaneknapp/update-python-executable. Authored-by: shane knapp Signed-off-by: shane knapp --- python/run-tests.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/run-tests.py b/python/run-tests.py index ccbdfac..921fdc9 100755 --- a/python/run-tests.py +++ b/python/run-tests.py @@ -162,7 +162,7 @@ def run_individual_python_test(target_dir, test_name, pyspark_python): def get_default_python_executables(): -python_execs = [x for x in ["python2.7", "python3.4", "pypy"] if which(x)] +python_execs = [x for x in ["python2.7", "python3.6", "pypy"] if which(x)] if "python2.7" not in python_execs: LOGGER.warning("Not testing against `python2.7` because it could not be found; falling" " back to `python` instead") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2
This is an automated email from the ASF dual-hosted git repository. lixiao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 777b450 [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2 777b450 is described below commit 777b4502b206b7240c6655d3c0b0a9ce08f6a09c Author: Yuming Wang AuthorDate: Fri Apr 19 08:59:08 2019 -0700 [SPARK-27176][FOLLOW-UP][SQL] Upgrade Hive parquet to 1.10.1 for hadoop-3.2 ## What changes were proposed in this pull request? When we compile and test Hadoop 3.2, we will hint the following two issues: 1. JobSummaryLevel is not a member of object org.apache.parquet.hadoop.ParquetOutputFormat. Fixed by [PARQUET-381](https://issues.apache.org/jira/browse/PARQUET-381)(Parquet 1.9.0) 2. java.lang.NoSuchFieldError: BROTLI at org.apache.parquet.hadoop.metadata.CompressionCodecName.(CompressionCodecName.java:31). Fixed by [PARQUET-1143](https://issues.apache.org/jira/browse/PARQUET-1143)(Parquet 1.10.0) The reason is that the `parquet-hadoop-bundle-1.8.1.jar` conflicts with Parquet 1.10.1. I think it would be safe to upgrade Hive's parquet to 1.10.1 to workaround this issue. This is what Hive did when upgrading Parquet 1.8.1 to 1.10.0: [HIVE-17000](https://issues.apache.org/jira/browse/HIVE-17000) and [HIVE-19464](https://issues.apache.org/jira/browse/HIVE-19464). We can see that all changes are related to vectors, and vectors are disabled by default: see [HIVE-14826](https://issues.apache.org/jira/browse/HIVE-14826) and [HiveConf.java#L2723](https://github.com/apache/hive/blob/rel/release-2.3.4/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2723). This pr removes [parquet-hadoop-bundle-1.8.1.jar](https://github.com/apache/parquet-mr/tree/master/parquet-hadoop-bundle) , so Hive serde will use [parquet-common-1.10.1.jar, parquet-column-1.10.1.jar and parquet-hadoop-1.10.1.jar](https://github.com/apache/spark/blob/master/dev/deps/spark-deps-hadoop-3.2#L185-L189). ## How was this patch tested? 1. manual tests 2. [upgrade Hive Parquet to 1.10.1 annd run Hadoop 3.2 test on jenkins](https://github.com/apache/spark/pull/24044#commits-pushed-0c3f962) Closes #24346 from wangyum/SPARK-27176. Authored-by: Yuming Wang Signed-off-by: gatorsmile --- dev/deps/spark-deps-hadoop-3.2 | 1 - pom.xml| 8 +--- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3.2 b/dev/deps/spark-deps-hadoop-3.2 index a45f02d..8b3bd79 100644 --- a/dev/deps/spark-deps-hadoop-3.2 +++ b/dev/deps/spark-deps-hadoop-3.2 @@ -187,7 +187,6 @@ parquet-common-1.10.1.jar parquet-encoding-1.10.1.jar parquet-format-2.4.0.jar parquet-hadoop-1.10.1.jar -parquet-hadoop-bundle-1.6.0.jar parquet-jackson-1.10.1.jar protobuf-java-2.5.0.jar py4j-0.10.8.1.jar diff --git a/pom.xml b/pom.xml index fce4cbd..5879a76 100644 --- a/pom.xml +++ b/pom.xml @@ -221,6 +221,7 @@ --> compile compile +${hive.deps.scope} compile compile test @@ -2004,7 +2005,7 @@ ${hive.parquet.group} parquet-hadoop-bundle ${hive.parquet.version} -compile +${hive.parquet.scope} org.codehaus.janino @@ -2818,8 +2819,9 @@ core ${hive23.version} 2.3.4 -org.apache.parquet -1.8.1 + +provided 4.1.17 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27486][CORE][TEST] Enable History server storage information test in the HistoryServerSuite
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 16bbe0f [SPARK-27486][CORE][TEST] Enable History server storage information test in the HistoryServerSuite 16bbe0f is described below commit 16bbe0f798d2a75ba0a9f2daff709292f6971755 Author: Shahid AuthorDate: Fri Apr 19 08:12:20 2019 -0700 [SPARK-27486][CORE][TEST] Enable History server storage information test in the HistoryServerSuite ## What changes were proposed in this pull request? We have disabled a test related to storage in the History server suite after SPARK-13845. But, after SPARK-22050, we can store the information about block updated events to eventLog, if we enable "spark.eventLog.logBlockUpdates.enabled=true". So, we can enable the test, by adding an eventlog corresponding to the application, which has enabled the configuration, "spark.eventLog.logBlockUpdates.enabled=true" ## How was this patch tested? Existing UTs Closes #24390 from shahidki31/enableRddStorageTest. Authored-by: Shahid Signed-off-by: Sean Owen --- .../one_rdd_storage_json_expectation.json | 10 ++ .../org/apache/spark/deploy/history/HistoryServerSuite.scala | 8 +--- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/core/src/test/resources/HistoryServerExpectations/one_rdd_storage_json_expectation.json b/core/src/test/resources/HistoryServerExpectations/one_rdd_storage_json_expectation.json new file mode 100644 index 000..09afdf5 --- /dev/null +++ b/core/src/test/resources/HistoryServerExpectations/one_rdd_storage_json_expectation.json @@ -0,0 +1,10 @@ +{ + "id" : 0, + "name" : "0", + "numPartitions" : 8, + "numCachedPartitions" : 0, + "storageLevel" : "Memory Deserialized 1x Replicated", + "memoryUsed" : 0, + "diskUsed" : 0, + "partitions" : [ ] +} diff --git a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala index c99bcf2..5ba27a0 100644 --- a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala @@ -174,9 +174,11 @@ class HistoryServerSuite extends SparkFunSuite with BeforeAndAfter with Matchers "executor node blacklisting unblacklisting" -> "applications/app-20161115172038-/executors", "executor memory usage" -> "applications/app-20161116163331-/executors", -"app environment" -> "applications/app-20161116163331-/environment" -// Todo: enable this test when logging the even of onBlockUpdated. See: SPARK-13845 -// "one rdd storage json" -> "applications/local-1422981780767/storage/rdd/0" +"app environment" -> "applications/app-20161116163331-/environment", + +// Enable "spark.eventLog.logBlockUpdates.enabled", to get the storage information +// in the history server. +"one rdd storage json" -> "applications/local-1422981780767/storage/rdd/0" ) // run a bunch of characterization tests -- just verify the behavior is the same as what is saved - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-27504][SQL] File source V2: support refreshing metadata cache
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 31488e1 [SPARK-27504][SQL] File source V2: support refreshing metadata cache 31488e1 is described below commit 31488e1ca506efd34459e6bc9a08b6d0956c8d44 Author: Gengliang Wang AuthorDate: Fri Apr 19 18:26:03 2019 +0800 [SPARK-27504][SQL] File source V2: support refreshing metadata cache ## What changes were proposed in this pull request? In file source V1, if some file is deleted manually, reading the DataFrame/Table will throws an exception with suggestion message ``` It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. ``` After refreshing the table/DataFrame, the reads should return correct results. We should follow it in file source V2 as well. ## How was this patch tested? Unit test Closes #24401 from gengliangwang/refreshFileTable. Authored-by: Gengliang Wang Signed-off-by: Wenchen Fan --- .../datasources/v2/DataSourceV2Relation.scala | 5 +++ .../datasources/v2/FilePartitionReader.scala | 38 -- .../org/apache/spark/sql/MetadataCacheSuite.scala | 35 ++-- 3 files changed, 50 insertions(+), 28 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala index 4119957..e7e0be0 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala @@ -66,6 +66,11 @@ case class DataSourceV2Relation( override def newInstance(): DataSourceV2Relation = { copy(output = output.map(_.newInstance())) } + + override def refresh(): Unit = table match { +case table: FileTable => table.fileIndex.refresh() +case _ => // Do nothing. + } } /** diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala index 7c7b468..d4bad29 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FilePartitionReader.scala @@ -34,25 +34,27 @@ class FilePartitionReader[T](readers: Iterator[PartitionedFileReader[T]]) override def next(): Boolean = { if (currentReader == null) { if (readers.hasNext) { -if (ignoreMissingFiles || ignoreCorruptFiles) { - try { -currentReader = getNextReader() - } catch { -case e: FileNotFoundException if ignoreMissingFiles => - logWarning(s"Skipped missing file: $currentReader", e) - currentReader = null - return false -// Throw FileNotFoundException even if `ignoreCorruptFiles` is true -case e: FileNotFoundException if !ignoreMissingFiles => throw e -case e @ (_: RuntimeException | _: IOException) if ignoreCorruptFiles => - logWarning( -s"Skipped the rest of the content in the corrupted file: $currentReader", e) - currentReader = null - InputFileBlockHolder.unset() - return false - } -} else { +try { currentReader = getNextReader() +} catch { + case e: FileNotFoundException if ignoreMissingFiles => +logWarning(s"Skipped missing file: $currentReader", e) +currentReader = null +return false + // Throw FileNotFoundException even if `ignoreCorruptFiles` is true + case e: FileNotFoundException if !ignoreMissingFiles => +throw new FileNotFoundException( + e.getMessage + "\n" + +"It is possible the underlying files have been updated. " + +"You can explicitly invalidate the cache in Spark by " + +"running 'REFRESH TABLE tableName' command in SQL or " + +"by recreating the Dataset/DataFrame involved.") + case e @ (_: RuntimeException | _: IOException) if ignoreCorruptFiles => +logWarning( + s"Skipped the rest of the content in the corrupted file: $currentReader", e) +currentReader = null +InputFileBlockHolder.unset() +return false } } else { return false diff --git
[spark] branch master updated: [SPARK-27514] Skip collapsing windows with empty window expressions
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 163a6e2 [SPARK-27514] Skip collapsing windows with empty window expressions 163a6e2 is described below commit 163a6e298213f216f74f4764e241ee6298ea30b6 Author: Yifei Huang AuthorDate: Fri Apr 19 14:04:44 2019 +0800 [SPARK-27514] Skip collapsing windows with empty window expressions ## What changes were proposed in this pull request? A previous change moved the removal of empty window expressions to the RemoveNoopOperations rule, which comes after the CollapseWindow rule. Therefore, by the time we get to CollapseWindow, we aren't guaranteed that empty windows have been removed. This change checks that the window expressions are not empty, and only collapses the windows if both windows are non-empty. A lengthier description and repro steps here: https://issues.apache.org/jira/browse/SPARK-27514 ## How was this patch tested? A unit test, plus I reran the breaking case mentioned in the Jira ticket. Closes #24411 from yifeih/yh/spark-27514. Authored-by: Yifei Huang Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/catalyst/optimizer/Optimizer.scala | 1 + .../spark/sql/catalyst/optimizer/CollapseWindowSuite.scala| 11 +++ 2 files changed, 12 insertions(+) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index afdf61e..f32f2c7 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -770,6 +770,7 @@ object CollapseWindow extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { case w1 @ Window(we1, ps1, os1, w2 @ Window(we2, ps2, os2, grandChild)) if ps1 == ps2 && os1 == os2 && w1.references.intersect(w2.windowOutputSet).isEmpty && + we1.nonEmpty && we2.nonEmpty && // This assumes Window contains the same type of window expressions. This is ensured // by ExtractWindowFunctions. WindowFunctionType.functionType(we1.head) == WindowFunctionType.functionType(we2.head) => diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseWindowSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseWindowSuite.scala index 52054c2..3b3b490 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseWindowSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseWindowSuite.scala @@ -89,4 +89,15 @@ class CollapseWindowSuite extends PlanTest { val optimized = Optimize.execute(query.analyze) comparePlans(optimized, expected) } + + test("Skip windows with empty window expressions") { +val query = testRelation + .window(Seq(), partitionSpec1, orderSpec1) + .window(Seq(sum(a).as('sum_a)), partitionSpec1, orderSpec1) + +val optimized = Optimize.execute(query.analyze) +val correctAnswer = query.analyze + +comparePlans(optimized, correctAnswer) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org