[spark] branch master updated: [SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a745381 [SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0 a745381 is described below commit a745381b9d3dd290057ef3089de7fdb9264f1f8b Author: WeichenXu AuthorDate: Wed Jul 31 14:26:18 2019 +0900 [SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0 ## What changes were proposed in this pull request? I remove the deprecate `ImageSchema.readImages`. Move some useful methods from class `ImageSchema` into class `ImageFileFormat`. In pyspark, I rename `ImageSchema` class to be `ImageUtils`, and keep some useful python methods in it. ## How was this patch tested? UT. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25245 from WeichenXu123/remove_image_schema. Authored-by: WeichenXu Signed-off-by: HyukjinKwon --- .../org/apache/spark/ml/image/ImageSchema.scala| 72 - .../apache/spark/ml/image/ImageSchemaSuite.scala | 171 - .../ml/source/image/ImageFileFormatSuite.scala | 18 +++ project/MimaExcludes.scala | 6 +- python/pyspark/ml/image.py | 38 - python/pyspark/ml/tests/test_image.py | 29 ++-- 6 files changed, 42 insertions(+), 292 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala index a7ddf2f..0313626 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala @@ -191,76 +191,4 @@ object ImageSchema { Some(Row(Row(origin, height, width, nChannels, mode, decoded))) } } - - /** - * Read the directory of images from the local or remote source - * - * @note If multiple jobs are run in parallel with different sampleRatio or recursive flag, - * there may be a race condition where one job overwrites the hadoop configs of another. - * @note If sample ratio is less than 1, sampling uses a PathFilter that is efficient but - * potentially non-deterministic. - * - * @param path Path to the image directory - * @return DataFrame with a single column "image" of images; - * see ImageSchema for the details - */ - @deprecated("use `spark.read.format(\"image\").load(path)` and this `readImages` will be " + -"removed in 3.0.0.", "2.4.0") - def readImages(path: String): DataFrame = readImages(path, null, false, -1, false, 1.0, 0) - - /** - * Read the directory of images from the local or remote source - * - * @note If multiple jobs are run in parallel with different sampleRatio or recursive flag, - * there may be a race condition where one job overwrites the hadoop configs of another. - * @note If sample ratio is less than 1, sampling uses a PathFilter that is efficient but - * potentially non-deterministic. - * - * @param path Path to the image directory - * @param sparkSession Spark Session, if omitted gets or creates the session - * @param recursive Recursive path search flag - * @param numPartitions Number of the DataFrame partitions, - * if omitted uses defaultParallelism instead - * @param dropImageFailures Drop the files that are not valid images from the result - * @param sampleRatio Fraction of the files loaded - * @return DataFrame with a single column "image" of images; - * see ImageSchema for the details - */ - @deprecated("use `spark.read.format(\"image\").load(path)` and this `readImages` will be " + -"removed in 3.0.0.", "2.4.0") - def readImages( - path: String, - sparkSession: SparkSession, - recursive: Boolean, - numPartitions: Int, - dropImageFailures: Boolean, - sampleRatio: Double, - seed: Long): DataFrame = { -require(sampleRatio <= 1.0 && sampleRatio >= 0, "sampleRatio should be between 0 and 1") - -val session = if (sparkSession != null) sparkSession else SparkSession.builder().getOrCreate -val partitions = - if (numPartitions > 0) { -numPartitions - } else { -session.sparkContext.defaultParallelism - } - -RecursiveFlag.withRecursiveFlag(recursive, session) { - SamplePathFilter.withPathFilter(sampleRatio, session, seed) { -val binResult = session.sparkContext.binaryFiles(path, partitions) -val streams = if (numPartitions == -1) binResult else binResult.repartition(partitions) -val convert = (origin: String, bytes: PortableDataStream) => - decode(origin, bytes.toArray()) -val images = if (dropImageFailures) { -
[spark] branch branch-2.3 updated: [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.3 by this push: new 78d1bb1 [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress 78d1bb1 is described below commit 78d1bb188efa55038f63ece70aaa0a5ebaa75f5f Author: gengjiaan AuthorDate: Wed Jul 31 12:17:44 2019 +0900 [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress ## What changes were proposed in this pull request? The latest docs http://spark.apache.org/docs/latest/configuration.html contains some description as below: spark.ui.showConsoleProgress | true | Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. -- | -- | -- But the class `org.apache.spark.internal.config.UI` define the config `spark.ui.showConsoleProgress` as below: ``` val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress") .doc("When true, show the progress bar in the console.") .booleanConf .createWithDefault(false) ``` So I think there are exists some little mistake and lead to confuse reader. ## How was this patch tested? No need UT. Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress. Authored-by: gengjiaan Signed-off-by: HyukjinKwon (cherry picked from commit dba4375359a2dfed1f009edc3b1bcf6b3253fe02) Signed-off-by: HyukjinKwon --- docs/configuration.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index 42e6d87..822f903 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -867,11 +867,13 @@ Apart from these, the following properties are also available, and may be useful spark.ui.showConsoleProgress - true + false Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. + +Note: In shell environment, the default value of spark.ui.showConsoleProgress is true. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 992b1bb [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress 992b1bb is described below commit 992b1bb3697f6ec9fb5fc2853c33b591715dfd76 Author: gengjiaan AuthorDate: Wed Jul 31 12:17:44 2019 +0900 [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress ## What changes were proposed in this pull request? The latest docs http://spark.apache.org/docs/latest/configuration.html contains some description as below: spark.ui.showConsoleProgress | true | Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. -- | -- | -- But the class `org.apache.spark.internal.config.UI` define the config `spark.ui.showConsoleProgress` as below: ``` val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress") .doc("When true, show the progress bar in the console.") .booleanConf .createWithDefault(false) ``` So I think there are exists some little mistake and lead to confuse reader. ## How was this patch tested? No need UT. Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress. Authored-by: gengjiaan Signed-off-by: HyukjinKwon (cherry picked from commit dba4375359a2dfed1f009edc3b1bcf6b3253fe02) Signed-off-by: HyukjinKwon --- docs/configuration.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index 73bae77..d481986 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -891,11 +891,13 @@ Apart from these, the following properties are also available, and may be useful spark.ui.showConsoleProgress - true + false Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. + +Note: In shell environment, the default value of spark.ui.showConsoleProgress is true. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new dba4375 [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress dba4375 is described below commit dba4375359a2dfed1f009edc3b1bcf6b3253fe02 Author: gengjiaan AuthorDate: Wed Jul 31 12:17:44 2019 +0900 [MINOR][CORE][DOCS] Fix inconsistent description of showConsoleProgress ## What changes were proposed in this pull request? The latest docs http://spark.apache.org/docs/latest/configuration.html contains some description as below: spark.ui.showConsoleProgress | true | Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. -- | -- | -- But the class `org.apache.spark.internal.config.UI` define the config `spark.ui.showConsoleProgress` as below: ``` val UI_SHOW_CONSOLE_PROGRESS = ConfigBuilder("spark.ui.showConsoleProgress") .doc("When true, show the progress bar in the console.") .booleanConf .createWithDefault(false) ``` So I think there are exists some little mistake and lead to confuse reader. ## How was this patch tested? No need UT. Closes #25297 from beliefer/inconsistent-desc-showConsoleProgress. Authored-by: gengjiaan Signed-off-by: HyukjinKwon --- docs/configuration.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index 10886241..93a4fcc 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1062,11 +1062,13 @@ Apart from these, the following properties are also available, and may be useful spark.ui.showConsoleProgress - true + false Show the progress bar in the console. The progress bar shows the progress of stages that run for longer than 500ms. If multiple stages run at the same time, multiple progress bars will be displayed on the same line. + +Note: In shell environment, the default value of spark.ui.showConsoleProgress is true. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28038][SQL][TEST] Port text.sql
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 261e113 [SPARK-28038][SQL][TEST] Port text.sql 261e113 is described below commit 261e113449cf1ae84feb073caff01cc27cb5d10f Author: Yuming Wang AuthorDate: Wed Jul 31 11:36:26 2019 +0900 [SPARK-28038][SQL][TEST] Port text.sql ## What changes were proposed in this pull request? This PR is to port text.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/text.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/text.out When porting the test cases, found a PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28037](https://issues.apache.org/jira/browse/SPARK-28037): Add built-in String Functions: quote_literal Also, found three inconsistent behavior: [SPARK-27930](https://issues.apache.org/jira/browse/SPARK-27930): Spark SQL's format_string can not fully support PostgreSQL's format [SPARK-28036](https://issues.apache.org/jira/browse/SPARK-28036): Built-in udf left/right has inconsistent behavior [SPARK-28033](https://issues.apache.org/jira/browse/SPARK-28033): String concatenation should low priority than other operators ## How was this patch tested? N/A Closes #24862 from wangyum/SPARK-28038. Authored-by: Yuming Wang Signed-off-by: HyukjinKwon --- .../test/resources/sql-tests/inputs/pgSQL/text.sql | 137 .../resources/sql-tests/results/pgSQL/text.sql.out | 375 + 2 files changed, 512 insertions(+) diff --git a/sql/core/src/test/resources/sql-tests/inputs/pgSQL/text.sql b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/text.sql new file mode 100644 index 000..04d3acc --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/text.sql @@ -0,0 +1,137 @@ +-- +-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group +-- +-- +-- TEXT +-- https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/text.sql + +SELECT string('this is a text string') = string('this is a text string') AS true; + +SELECT string('this is a text string') = string('this is a text strin') AS `false`; + +CREATE TABLE TEXT_TBL (f1 string) USING parquet; + +INSERT INTO TEXT_TBL VALUES ('doh!'); +INSERT INTO TEXT_TBL VALUES ('hi de ho neighbor'); + +SELECT '' AS two, * FROM TEXT_TBL; + +-- As of 8.3 we have removed most implicit casts to text, so that for example +-- this no longer works: +-- Spark SQL implicit cast integer to string +select length(42); + +-- But as a special exception for usability's sake, we still allow implicit +-- casting to text in concatenations, so long as the other input is text or +-- an unknown literal. So these work: +-- [SPARK-28033] String concatenation low priority than other arithmeticBinary +select string('four: ') || 2+2; +select 'four: ' || 2+2; + +-- but not this: +-- Spark SQL implicit cast both side to string +select 3 || 4.0; + +/* + * various string functions + */ +select concat('one'); +-- Spark SQL does not support MMDD, we replace it to MMdd +select concat(1,2,3,'hello',true, false, to_date('20100309','MMdd')); +select concat_ws('#','one'); +select concat_ws('#',1,2,3,'hello',true, false, to_date('20100309','MMdd')); +select concat_ws(',',10,20,null,30); +select concat_ws('',10,20,null,30); +select concat_ws(NULL,10,20,null,30) is null; +select reverse('abcde'); +-- [SPARK-28036] Built-in udf left/right has inconsistent behavior +-- [SPARK-28479] Parser error when enabling ANSI mode +set spark.sql.parser.ansi.enabled=false; +select i, left('ahoj', i), right('ahoj', i) from range(-5, 6) t(i) order by i; +set spark.sql.parser.ansi.enabled=true; +-- [SPARK-28037] Add built-in String Functions: quote_literal +-- select quote_literal(''); +-- select quote_literal('abc'''); +-- select quote_literal(e'\\'); + +-- Skip these tests because Spark does not support variadic labeled argument +-- check variadic labeled argument +-- select concat(variadic array[1,2,3]); +-- select concat_ws(',', variadic array[1,2,3]); +-- select concat_ws(',', variadic NULL::int[]); +-- select concat(variadic NULL::int[]) is NULL; +-- select concat(variadic '{}'::int[]) = ''; +--should fail +-- select concat_ws(',', variadic 10); + +-- [SPARK-27930] Replace format to format_string +/* + * format + */ +select format_string(NULL); +select format_string('Hello'); +select format_string('Hello %s', 'World'); +select format_string('Hello %%'); +select format_string('Hello '); +-- should fail +select format_string('Hello %s %s', 'World'); +select format_string('Hello %s'); +select format_string('Hello %x', 20); +--
[spark] branch master updated: [SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3b14088 [SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon 3b14088 is described below commit 3b140885410362fced9d98fca61d6a357de604af Author: WeichenXu AuthorDate: Wed Jul 31 09:10:24 2019 +0900 [SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon ## What changes were proposed in this pull request? PySpark worker daemon reads from stdin the worker PIDs to kill. https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127 However, the worker process is a forked process from the worker daemon process and we didn't close stdin on the child after fork. This means the child and user program can read stdin as well, which blocks daemon from receiving the PID to kill. This can cause issues because the task reaper might detect the task was not terminated and eventually kill the JVM. This PR fix this by redirecting the standard input of the forked child to devnull. ## How was this patch tested? Manually test. In `pyspark`, run: ``` import subprocess def task(_): subprocess.check_output(["cat"]) sc.parallelize(range(1), 1).mapPartitions(task).count() ``` Before: The job will get stuck and press Ctrl+C to exit the job but the python worker process do not exit. After: The job finish correctly. The "cat" print nothing (because the dummay stdin is "/dev/null"). The python worker process exit normally. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25138 from WeichenXu123/SPARK-26175. Authored-by: WeichenXu Signed-off-by: HyukjinKwon --- python/pyspark/daemon.py | 15 +++ python/pyspark/sql/tests/test_udf.py | 12 2 files changed, 27 insertions(+) diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py index 6f42ad3..97b6b25 100644 --- a/python/pyspark/daemon.py +++ b/python/pyspark/daemon.py @@ -160,6 +160,21 @@ def manager(): if pid == 0: # in child process listen_sock.close() + +# It should close the standard input in the child process so that +# Python native function executions stay intact. +# +# Note that if we just close the standard input (file descriptor 0), +# the lowest file descriptor (file descriptor 0) will be allocated, +# later when other file descriptors should happen to open. +# +# Therefore, here we redirects it to '/dev/null' by duplicating +# another file descriptor for '/dev/null' to the standard input (0). +# See SPARK-26175. +devnull = open(os.devnull, 'r') +os.dup2(devnull.fileno(), 0) +devnull.close() + try: # Acknowledge that the fork was successful outfile = sock.makefile(mode="wb") diff --git a/python/pyspark/sql/tests/test_udf.py b/python/pyspark/sql/tests/test_udf.py index 803d471..1999311 100644 --- a/python/pyspark/sql/tests/test_udf.py +++ b/python/pyspark/sql/tests/test_udf.py @@ -616,6 +616,18 @@ class UDFTests(ReusedSQLTestCase): self.spark.range(1).select(f()).collect() +def test_worker_original_stdin_closed(self): +# Test if it closes the original standard input of worker inherited from the daemon, +# and replaces it with '/dev/null'. See SPARK-26175. +def task(iterator): +import sys +res = sys.stdin.read() +# Because the standard input is '/dev/null', it reaches to EOF. +assert res == '', "Expect read EOF from stdin." +return iterator + +self.sc.parallelize(range(1), 1).mapPartitions(task).count() + class UDFInitializationTests(unittest.TestCase): def tearDown(self): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated (5f4feeb -> 9d9c5a5)
This is an automated email from the ASF dual-hosted git repository. dbtsai pushed a change to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git. from 5f4feeb [SPARK-25474][SQL][2.4] Support `spark.sql.statistics.fallBackToHdfs` in data source tables add 9d9c5a5 [SPARK-26152][CORE][2.4] Synchronize Worker Cleanup with Worker Shutdown No new revisions were added by this update. Summary of changes: .../org/apache/spark/deploy/worker/Worker.scala| 68 +- 1 file changed, 39 insertions(+), 29 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-28209][CORE][SHUFFLE] Proposed new shuffle writer API
This is an automated email from the ASF dual-hosted git repository. vanzin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new abef84a [SPARK-28209][CORE][SHUFFLE] Proposed new shuffle writer API abef84a is described below commit abef84a868e9e15f346eea315bbab0ec8ac8e389 Author: mcheah AuthorDate: Tue Jul 30 14:17:30 2019 -0700 [SPARK-28209][CORE][SHUFFLE] Proposed new shuffle writer API ## What changes were proposed in this pull request? As part of the shuffle storage API proposed in SPARK-25299, this introduces an API for persisting shuffle data in arbitrary storage systems. This patch introduces several concepts: * `ShuffleDataIO`, which is the root of the entire plugin tree that will be proposed over the course of the shuffle API project. * `ShuffleExecutorComponents` - the subset of plugins for managing shuffle-related components for each executor. This will in turn instantiate shuffle readers and writers. * `ShuffleMapOutputWriter` interface - instantiated once per map task. This provides child `ShufflePartitionWriter` instances for persisting the bytes for each partition in the map task. The default implementation of these plugins exactly mirror what was done by the existing shuffle writing code - namely, writing the data to local disk and writing an index file. We leverage the APIs in the `BypassMergeSortShuffleWriter` only. Follow-up PRs will use the APIs in `SortShuffleWriter` and `UnsafeShuffleWriter`, but are left as future work to minimize the review surface area. ## How was this patch tested? New unit tests were added. Micro-benchmarks indicate there's no slowdown in the affected code paths. Closes #25007 from mccheah/spark-shuffle-writer-refactor. Lead-authored-by: mcheah Co-authored-by: mccheah Signed-off-by: Marcelo Vanzin --- .../apache/spark/shuffle/api/ShuffleDataIO.java| 49 .../shuffle/api/ShuffleExecutorComponents.java | 55 + .../spark/shuffle/api/ShuffleMapOutputWriter.java | 71 ++ .../spark/shuffle/api/ShufflePartitionWriter.java | 98 .../shuffle/api/WritableByteChannelWrapper.java| 42 .../shuffle/sort/BypassMergeSortShuffleWriter.java | 173 +- .../shuffle/sort/io/LocalDiskShuffleDataIO.java| 40 .../io/LocalDiskShuffleExecutorComponents.java | 71 ++ .../sort/io/LocalDiskShuffleMapOutputWriter.java | 261 + .../org/apache/spark/internal/config/package.scala | 7 + .../spark/shuffle/sort/SortShuffleManager.scala| 25 +- .../main/scala/org/apache/spark/util/Utils.scala | 30 ++- .../test/scala/org/apache/spark/ShuffleSuite.scala | 16 +- .../sort/BypassMergeSortShuffleWriterSuite.scala | 149 +++- .../io/LocalDiskShuffleMapOutputWriterSuite.scala | 147 15 files changed, 1087 insertions(+), 147 deletions(-) diff --git a/core/src/main/java/org/apache/spark/shuffle/api/ShuffleDataIO.java b/core/src/main/java/org/apache/spark/shuffle/api/ShuffleDataIO.java new file mode 100644 index 000..e9e50ec --- /dev/null +++ b/core/src/main/java/org/apache/spark/shuffle/api/ShuffleDataIO.java @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.shuffle.api; + +import org.apache.spark.annotation.Private; + +/** + * :: Private :: + * An interface for plugging in modules for storing and reading temporary shuffle data. + * + * This is the root of a plugin system for storing shuffle bytes to arbitrary storage + * backends in the sort-based shuffle algorithm implemented by the + * {@link org.apache.spark.shuffle.sort.SortShuffleManager}. If another shuffle algorithm is + * needed instead of sort-based shuffle, one should implement + * {@link org.apache.spark.shuffle.ShuffleManager} instead. + * + * A single instance of this module is loaded per process in the Spark application. + * The default implementation reads and writes shuffle data from the local disks of + * the executor, and is the implementation of shuffle file
[spark] branch master updated (44c28d7 -> 121f933)
This is an automated email from the ASF dual-hosted git repository. vanzin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 44c28d7 [SPARK-28399][ML][PYTHON] implement RobustScaler add 121f933 [SPARK-28525][DEPLOY] Allow Launcher to be applied Java options No new revisions were added by this update. Summary of changes: bin/spark-class| 18 +++--- conf/spark-env.sh.template | 3 +++ .../src/main/java/org/apache/spark/launcher/Main.java | 3 +++ 3 files changed, 21 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (70910e6 -> 44c28d7)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 70910e6 [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… add 44c28d7 [SPARK-28399][ML][PYTHON] implement RobustScaler No new revisions were added by this update. Summary of changes: docs/ml-features.md| 45 ...erExample.java => JavaRobustScalerExample.java} | 22 +- ..._scaler_example.py => robust_scaler_example.py} | 13 +- ...alerExample.scala => RobustScalerExample.scala} | 18 +- .../org/apache/spark/ml/feature/RobustScaler.scala | 288 + .../apache/spark/ml/feature/StandardScaler.scala | 131 +++--- .../spark/mllib/feature/StandardScaler.scala | 116 - .../spark/ml/feature/RobustScalerSuite.scala | 209 +++ .../spark/ml/feature/StandardScalerSuite.scala | 4 +- python/pyspark/ml/feature.py | 162 10 files changed, 878 insertions(+), 130 deletions(-) copy examples/src/main/java/org/apache/spark/examples/ml/{JavaStandardScalerExample.java => JavaRobustScalerExample.java} (73%) copy examples/src/main/python/ml/{standard_scaler_example.py => robust_scaler_example.py} (75%) copy examples/src/main/scala/org/apache/spark/examples/ml/{StandardScalerExample.scala => RobustScalerExample.scala} (79%) create mode 100644 mllib/src/main/scala/org/apache/spark/ml/feature/RobustScaler.scala create mode 100644 mllib/src/test/scala/org/apache/spark/ml/feature/RobustScalerSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
This is an automated email from the ASF dual-hosted git repository. irashid pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 70910e6 [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… 70910e6 is described below commit 70910e6ad00f0de4075217d5305d87a477ff1dc4 Author: pgandhi AuthorDate: Tue Jul 30 09:54:51 2019 -0500 [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… … more efficiently This PR improves the performance of scheduling speculative tasks to be O(1) instead of O(numSpeculativeTasks), using the same approach used for scheduling regular tasks. The performance of this method is particularly important because a lock is held on the TaskSchedulerImpl which is a bottleneck for all scheduling operations. We ran a Join query on a large dataset with speculation enabled and out of 10 tasks for the ShuffleMapStage, the maximum number of speculatable tasks that wa [...] In particular, this works by storing a separate stack of tasks by executor, node, and rack locality preferences. Then when trying to schedule a speculative task, rather than scanning all speculative tasks to find ones which match the given executor (or node, or rack) preference, we can jump to a quick check of tasks matching the resource offer. This technique was already used for regular tasks -- this change refactors the code to allow sharing with regular and speculative task execution. ## What changes were proposed in this pull request? Have split the main queue "speculatableTasks" into 5 separate queues based on locality preference similar to how normal tasks are enqueued. Thus, the "dequeueSpeculativeTask" method will avoid performing locality checks for each task at runtime and simply return the preferable task to be executed. ## How was this patch tested? We ran a spark job that performed a join on a 10 TB dataset to test the code change. Original Code: https://user-images.githubusercontent.com/8190/51873321-572df280-2322-11e9-9149-0aae08d5edc6.png;> Optimized Code: https://user-images.githubusercontent.com/8190/51873343-6745d200-2322-11e9-947b-2cfd0f06bcab.png;> As you can see, the run time of the ShuffleMapStage came down from 40 min to 6 min approximately, thus, reducing the overall running time of the spark job by a significant amount. Another example for the same job: Original Code: https://user-images.githubusercontent.com/8190/51873355-70cf3a00-2322-11e9-9c3a-af035449a306.png;> Optimized Code: https://user-images.githubusercontent.com/8190/51873367-7dec2900-2322-11e9-8d07-1b1b49285f71.png;> Closes #23677 from pgandhi999/SPARK-26755. Lead-authored-by: pgandhi Co-authored-by: pgandhi Signed-off-by: Imran Rashid --- .../apache/spark/scheduler/TaskSetManager.scala| 292 - .../scheduler/OutputCommitCoordinatorSuite.scala | 12 +- .../spark/scheduler/TaskSetManagerSuite.scala | 122 - 3 files changed, 242 insertions(+), 184 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala b/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala index e7645fc..79a1afc 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala @@ -131,37 +131,17 @@ private[spark] class TaskSetManager( // same time for a barrier stage. private[scheduler] def isBarrier = taskSet.tasks.nonEmpty && taskSet.tasks(0).isBarrier - // Set of pending tasks for each executor. These collections are actually - // treated as stacks, in which new tasks are added to the end of the - // ArrayBuffer and removed from the end. This makes it faster to detect - // tasks that repeatedly fail because whenever a task failed, it is put - // back at the head of the stack. These collections may contain duplicates - // for two reasons: - // (1): Tasks are only removed lazily; when a task is launched, it remains - // in all the pending lists except the one that it was launched from. - // (2): Tasks may be re-added to these lists multiple times as a result - // of failures. - // Duplicates are handled in dequeueTaskFromList, which ensures that a - // task hasn't already started running before launching it. - private val pendingTasksForExecutor = new HashMap[String, ArrayBuffer[Int]] - - // Set of pending tasks for each host. Similar to pendingTasksForExecutor, - // but at host level. - private val pendingTasksForHost = new HashMap[String, ArrayBuffer[Int]] - - // Set of pending tasks for each rack -- similar to the above. - private val pendingTasksForRack = new HashMap[String, ArrayBuffer[Int]] - -
[spark] branch master updated: [SPARK-28071][SQL][TEST] Port strings.sql
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2656c9d [SPARK-28071][SQL][TEST] Port strings.sql 2656c9d is described below commit 2656c9d304b59584c331b923e8536e4093d83f81 Author: Yuming Wang AuthorDate: Tue Jul 30 18:54:14 2019 +0900 [SPARK-28071][SQL][TEST] Port strings.sql ## What changes were proposed in this pull request? This PR is to port strings.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/strings.sql The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/strings.out When porting the test cases, found nine PostgreSQL specific features that do not exist in Spark SQL: [SPARK-28076](https://issues.apache.org/jira/browse/SPARK-28076): Support regular expression substring [SPARK-28078](https://issues.apache.org/jira/browse/SPARK-28078): Add support other 4 REGEXP functions [SPARK-28412](https://issues.apache.org/jira/browse/SPARK-28412): OVERLAY function support byte array [SPARK-28083](https://issues.apache.org/jira/browse/SPARK-28083): ANSI SQL: LIKE predicate: ESCAPE clause [SPARK-28087](https://issues.apache.org/jira/browse/SPARK-28087): Add support split_part [SPARK-28122](https://issues.apache.org/jira/browse/SPARK-28122): Missing `sha224`/`sha256 `/`sha384 `/`sha512 ` functions [SPARK-28123](https://issues.apache.org/jira/browse/SPARK-28123): Add support string functions: btrim [SPARK-28448](https://issues.apache.org/jira/browse/SPARK-28448): Implement ILIKE operator [SPARK-28449](https://issues.apache.org/jira/browse/SPARK-28449): Missing escape_string_warning and standard_conforming_strings config Also, found five inconsistent behavior: [SPARK-27952](https://issues.apache.org/jira/browse/SPARK-27952): String Functions: regexp_replace is not compatible [SPARK-28121](https://issues.apache.org/jira/browse/SPARK-28121): decode can not accept 'escape' as charset [SPARK-27930](https://issues.apache.org/jira/browse/SPARK-27930): Replace `strpos` with `locate` or `position` in Spark SQL [SPARK-27930](https://issues.apache.org/jira/browse/SPARK-27930): Replace `to_hex` with `hex ` or in Spark SQL [SPARK-28451](https://issues.apache.org/jira/browse/SPARK-28451): `substr` returns different values ## How was this patch tested? N/A Closes #24923 from wangyum/SPARK-28071. Authored-by: Yuming Wang Signed-off-by: Takeshi Yamamuro --- .../resources/sql-tests/inputs/pgSQL/strings.sql | 660 +++ .../sql-tests/results/pgSQL/strings.sql.out| 718 + 2 files changed, 1378 insertions(+) diff --git a/sql/core/src/test/resources/sql-tests/inputs/pgSQL/strings.sql b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/strings.sql new file mode 100644 index 000..a684428 --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/pgSQL/strings.sql @@ -0,0 +1,660 @@ +-- +-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group +-- +-- STRINGS +-- -- https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/strings.sql +-- Test various data entry syntaxes. +-- + +-- SQL string continuation syntax +-- E021-03 character string literals +SELECT 'first line' +' - next line' + ' - third line' + AS `Three lines to one`; + +-- Spark SQL support this string continuation syntax +-- illegal string continuation syntax +SELECT 'first line' +' - next line' /* this comment is not allowed here */ +' - third line' + AS `Illegal comment within continuation`; + +-- [SPARK-28447] ANSI SQL: Unicode escapes in literals +-- Unicode escapes +-- SET standard_conforming_strings TO on; + +-- SELECT U&'d\0061t\+61' AS U&"d\0061t\+61"; +-- SELECT U&'d!0061t\+61' UESCAPE '!' AS U&"d*0061t\+61" UESCAPE '*'; + +-- SELECT U&' \' UESCAPE '!' AS "tricky"; +-- SELECT 'tricky' AS U&"\" UESCAPE '!'; + +-- SELECT U&'wrong: \061'; +-- SELECT U&'wrong: \+0061'; +-- SELECT U&'wrong: +0061' UESCAPE '+'; + +-- SET standard_conforming_strings TO off; + +-- SELECT U&'d\0061t\+61' AS U&"d\0061t\+61"; +-- SELECT U&'d!0061t\+61' UESCAPE '!' AS U&"d*0061t\+61" UESCAPE '*'; + +-- SELECT U&' \' UESCAPE '!' AS "tricky"; +-- SELECT 'tricky' AS U&"\" UESCAPE '!'; + +-- SELECT U&'wrong: \061'; +-- SELECT U&'wrong: \+0061'; +-- SELECT U&'wrong: +0061' UESCAPE '+'; + +-- RESET standard_conforming_strings; + +-- Spark SQL only support escape mode +-- bytea +-- SET bytea_output TO hex; +-- SELECT E'\\xDeAdBeEf'::bytea; +-- SELECT E'\\x De Ad Be Ef '::bytea; +-- SELECT E'\\xDeAdBeE'::bytea; +-- SELECT E'\\xDeAdBeEx'::bytea; +-- SELECT
[spark] branch master updated: [SPARK-28178][SQL] DataSourceV2: DataFrameWriter.insertInfo
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 749b1d3 [SPARK-28178][SQL] DataSourceV2: DataFrameWriter.insertInfo 749b1d3 is described below commit 749b1d3a45584554a2fd8c47e232f5095316e10c Author: John Zhuge AuthorDate: Tue Jul 30 17:22:33 2019 +0800 [SPARK-28178][SQL] DataSourceV2: DataFrameWriter.insertInfo ## What changes were proposed in this pull request? Support multiple catalogs in the following InsertInto use cases: - DataFrameWriter.insertInto("catalog.db.tbl") Support matrix: SaveMode|Partitioned Table|Partition Overwrite Mode|Action |-||-- Append|*|*|AppendData Overwrite|no|*|OverwriteByExpression(true) Overwrite|yes|STATIC|OverwriteByExpression(true) Overwrite|yes|DYNAMIC|OverwritePartitionsDynamic ## How was this patch tested? New tests. All existing catalyst and sql/core tests. Closes #24980 from jzhuge/SPARK-28178-pr. Authored-by: John Zhuge Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/DataFrameWriter.scala | 48 - .../sources/v2/DataSourceV2DataFrameSuite.scala| 107 + 2 files changed, 151 insertions(+), 4 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala index b7b1390..549c54f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala @@ -22,16 +22,19 @@ import java.util.{Locale, Properties, UUID} import scala.collection.JavaConverters._ import org.apache.spark.annotation.Stable +import org.apache.spark.sql.catalog.v2.{CatalogPlugin, Identifier} import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.{EliminateSubqueryAliases, UnresolvedRelation} import org.apache.spark.sql.catalyst.catalog._ import org.apache.spark.sql.catalyst.expressions.Literal -import org.apache.spark.sql.catalyst.plans.logical.{AppendData, InsertIntoTable, LogicalPlan, OverwriteByExpression} +import org.apache.spark.sql.catalyst.plans.logical.{AppendData, InsertIntoTable, LogicalPlan, OverwriteByExpression, OverwritePartitionsDynamic} import org.apache.spark.sql.execution.SQLExecution import org.apache.spark.sql.execution.command.DDLUtils import org.apache.spark.sql.execution.datasources.{CreateTable, DataSource, DataSourceUtils, LogicalRelation} import org.apache.spark.sql.execution.datasources.v2._ +import org.apache.spark.sql.internal.SQLConf.PartitionOverwriteMode import org.apache.spark.sql.sources.{BaseRelation, DataSourceRegister} +import org.apache.spark.sql.sources.BaseRelation import org.apache.spark.sql.sources.v2._ import org.apache.spark.sql.sources.v2.TableCapability._ import org.apache.spark.sql.types.StructType @@ -356,10 +359,8 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * @since 1.4.0 */ def insertInto(tableName: String): Unit = { - insertInto(df.sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)) - } +import df.sparkSession.sessionState.analyzer.{AsTableIdentifier, CatalogObjectIdentifier} - private def insertInto(tableIdent: TableIdentifier): Unit = { assertNotBucketed("insertInto") if (partitioningColumns.isDefined) { @@ -370,6 +371,45 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { ) } +df.sparkSession.sessionState.sqlParser.parseMultipartIdentifier(tableName) match { + case CatalogObjectIdentifier(Some(catalog), ident) => +insertInto(catalog, ident) + case AsTableIdentifier(tableIdentifier) => +insertInto(tableIdentifier) +} + } + + private def insertInto(catalog: CatalogPlugin, ident: Identifier): Unit = { +import org.apache.spark.sql.catalog.v2.CatalogV2Implicits._ + +val table = DataSourceV2Relation.create(catalog.asTableCatalog.loadTable(ident)) + +val command = modeForDSV2 match { + case SaveMode.Append => +AppendData.byName(table, df.logicalPlan) + + case SaveMode.Overwrite => +val conf = df.sparkSession.sessionState.conf +val dynamicPartitionOverwrite = table.table.partitioning.size > 0 && + conf.partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC + +if (dynamicPartitionOverwrite) { + OverwritePartitionsDynamic.byName(table, df.logicalPlan) +} else { + OverwriteByExpression.byName(table, df.logicalPlan, Literal(true)) +} + + case other => +throw new AnalysisException(s"insertInto does not support $other mode, " + + s"please use Append or
[spark] branch branch-2.3 updated (416aba4 -> 998ac04)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-2.3 in repository https://gitbox.apache.org/repos/asf/spark.git. from 416aba4 [SPARK-25474][SQL][2.3] Support `spark.sql.statistics.fallBackToHdfs` in data source tables add 998ac04 [SPARK-28156][SQL][BACKPORT-2.3] Self-join should not miss cached view No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 2 - .../sql/catalyst/analysis/CheckAnalysis.scala | 43 .../apache/spark/sql/catalyst/analysis/view.scala | 80 +++--- .../plans/logical/basicLogicalOperators.scala | 3 + .../scala/org/apache/spark/sql/SQLQuerySuite.scala | 31 + 5 files changed, 103 insertions(+), 56 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d530d86 -> df84bfe)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d530d86 [SPARK-28326][SQL][TEST] Port join.sql add df84bfe [SPARK-28406][SQL][TEST] Port union.sql No new revisions were added by this update. Summary of changes: .../resources/sql-tests/inputs/pgSQL/union.sql | 472 + .../sql-tests/results/pgSQL/union.sql.out | 771 + 2 files changed, 1243 insertions(+) create mode 100644 sql/core/src/test/resources/sql-tests/inputs/pgSQL/union.sql create mode 100644 sql/core/src/test/resources/sql-tests/results/pgSQL/union.sql.out - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (196a4d7 -> d530d86)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 196a4d7 [SPARK-28556][SQL] QueryExecutionListener should also notify Error add d530d86 [SPARK-28326][SQL][TEST] Port join.sql No new revisions were added by this update. Summary of changes: .../test/resources/sql-tests/inputs/pgSQL/join.sql | 2079 .../resources/sql-tests/results/pgSQL/join.sql.out | 3440 2 files changed, 5519 insertions(+) create mode 100644 sql/core/src/test/resources/sql-tests/inputs/pgSQL/join.sql create mode 100644 sql/core/src/test/resources/sql-tests/results/pgSQL/join.sql.out - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org