spark git commit: [MINOR][DOC] automatic type inference supports also Date and Timestamp
Repository: spark Updated Branches: refs/heads/branch-2.1 3d6d88996 -> 8b572116f [MINOR][DOC] automatic type inference supports also Date and Timestamp ## What changes were proposed in this pull request? Easy fix in the documentation, which is reporting that only numeric types and string are supported in type inference for partition columns, while Date and Timestamp are supported too since 2.1.0, thanks to SPARK-17388. ## How was this patch tested? n/a Author: Marco GaidoCloses #19628 from mgaido91/SPARK-22398. (cherry picked from commit b04eefae49b96e2ef5a8d75334db29ef4e19ce58) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8b572116 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8b572116 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8b572116 Branch: refs/heads/branch-2.1 Commit: 8b572116f8b9220f8c041c2f2f5c239fed947477 Parents: 3d6d889 Author: Marco Gaido Authored: Thu Nov 2 09:30:03 2017 +0900 Committer: hyukjinkwon Committed: Thu Nov 2 09:30:35 2017 +0900 -- docs/sql-programming-guide.md | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8b572116/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index e72a0be..92fa046 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -682,10 +682,11 @@ root {% endhighlight %} Notice that the data types of the partitioning columns are automatically inferred. Currently, -numeric data types and string type are supported. Sometimes users may not want to automatically -infer the data types of the partitioning columns. For these use cases, the automatic type inference -can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to -`true`. When type inference is disabled, string type will be used for the partitioning columns. +numeric data types, date, timestamp and string type are supported. Sometimes users may not want +to automatically infer the data types of the partitioning columns. For these use cases, the +automatic type inference can be configured by +`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to `true`. When type +inference is disabled, string type will be used for the partitioning columns. Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths by default. For the above example, if users pass `path/to/table/gender=male` to either - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][DOC] automatic type inference supports also Date and Timestamp
Repository: spark Updated Branches: refs/heads/branch-2.2 ab87a92a1 -> c311c5e79 [MINOR][DOC] automatic type inference supports also Date and Timestamp ## What changes were proposed in this pull request? Easy fix in the documentation, which is reporting that only numeric types and string are supported in type inference for partition columns, while Date and Timestamp are supported too since 2.1.0, thanks to SPARK-17388. ## How was this patch tested? n/a Author: Marco GaidoCloses #19628 from mgaido91/SPARK-22398. (cherry picked from commit b04eefae49b96e2ef5a8d75334db29ef4e19ce58) Signed-off-by: hyukjinkwon Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c311c5e7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c311c5e7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c311c5e7 Branch: refs/heads/branch-2.2 Commit: c311c5e7976a60a0f67d913ca10fe72a8559148f Parents: ab87a92 Author: Marco Gaido Authored: Thu Nov 2 09:30:03 2017 +0900 Committer: hyukjinkwon Committed: Thu Nov 2 09:30:20 2017 +0900 -- docs/sql-programming-guide.md | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c311c5e7/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 9a54adc..8cd4d05 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -778,10 +778,11 @@ root {% endhighlight %} Notice that the data types of the partitioning columns are automatically inferred. Currently, -numeric data types and string type are supported. Sometimes users may not want to automatically -infer the data types of the partitioning columns. For these use cases, the automatic type inference -can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to -`true`. When type inference is disabled, string type will be used for the partitioning columns. +numeric data types, date, timestamp and string type are supported. Sometimes users may not want +to automatically infer the data types of the partitioning columns. For these use cases, the +automatic type inference can be configured by +`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to `true`. When type +inference is disabled, string type will be used for the partitioning columns. Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths by default. For the above example, if users pass `path/to/table/gender=male` to either - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][DOC] automatic type inference supports also Date and Timestamp
Repository: spark Updated Branches: refs/heads/master d43e1f06b -> b04eefae4 [MINOR][DOC] automatic type inference supports also Date and Timestamp ## What changes were proposed in this pull request? Easy fix in the documentation, which is reporting that only numeric types and string are supported in type inference for partition columns, while Date and Timestamp are supported too since 2.1.0, thanks to SPARK-17388. ## How was this patch tested? n/a Author: Marco GaidoCloses #19628 from mgaido91/SPARK-22398. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b04eefae Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b04eefae Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b04eefae Branch: refs/heads/master Commit: b04eefae49b96e2ef5a8d75334db29ef4e19ce58 Parents: d43e1f0 Author: Marco Gaido Authored: Thu Nov 2 09:30:03 2017 +0900 Committer: hyukjinkwon Committed: Thu Nov 2 09:30:03 2017 +0900 -- docs/sql-programming-guide.md | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b04eefae/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 639a8ea..ce37787 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -800,10 +800,11 @@ root {% endhighlight %} Notice that the data types of the partitioning columns are automatically inferred. Currently, -numeric data types and string type are supported. Sometimes users may not want to automatically -infer the data types of the partitioning columns. For these use cases, the automatic type inference -can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to -`true`. When type inference is disabled, string type will be used for the partitioning columns. +numeric data types, date, timestamp and string type are supported. Sometimes users may not want +to automatically infer the data types of the partitioning columns. For these use cases, the +automatic type inference can be configured by +`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to `true`. When type +inference is disabled, string type will be used for the partitioning columns. Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths by default. For the above example, if users pass `path/to/table/gender=male` to either - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR] Data source v2 docs update.
Repository: spark Updated Branches: refs/heads/master 1ffe03d9e -> d43e1f06b [MINOR] Data source v2 docs update. ## What changes were proposed in this pull request? This patch includes some doc updates for data source API v2. I was reading the code and noticed some minor issues. ## How was this patch tested? This is a doc only change. Author: Reynold XinCloses #19626 from rxin/dsv2-update. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d43e1f06 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d43e1f06 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d43e1f06 Branch: refs/heads/master Commit: d43e1f06bd545d00bfcaf1efb388b469effd5d64 Parents: 1ffe03d Author: Reynold Xin Authored: Wed Nov 1 18:39:15 2017 +0100 Committer: Reynold Xin Committed: Wed Nov 1 18:39:15 2017 +0100 -- .../org/apache/spark/sql/sources/v2/DataSourceV2.java| 9 - .../org/apache/spark/sql/sources/v2/WriteSupport.java| 4 ++-- .../spark/sql/sources/v2/reader/DataSourceV2Reader.java | 10 +- .../v2/reader/SupportsPushDownCatalystFilters.java | 2 -- .../sql/sources/v2/reader/SupportsScanUnsafeRow.java | 2 -- .../spark/sql/sources/v2/writer/DataSourceV2Writer.java | 11 +++ .../apache/spark/sql/sources/v2/writer/DataWriter.java | 10 +- 7 files changed, 19 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d43e1f06/sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java index dbcbe32..6234071 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java +++ b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java @@ -20,12 +20,11 @@ package org.apache.spark.sql.sources.v2; import org.apache.spark.annotation.InterfaceStability; /** - * The base interface for data source v2. Implementations must have a public, no arguments - * constructor. + * The base interface for data source v2. Implementations must have a public, 0-arg constructor. * - * Note that this is an empty interface, data source implementations should mix-in at least one of - * the plug-in interfaces like {@link ReadSupport}. Otherwise it's just a dummy data source which is - * un-readable/writable. + * Note that this is an empty interface. Data source implementations should mix-in at least one of + * the plug-in interfaces like {@link ReadSupport} and {@link WriteSupport}. Otherwise it's just + * a dummy data source which is un-readable/writable. */ @InterfaceStability.Evolving public interface DataSourceV2 {} http://git-wip-us.apache.org/repos/asf/spark/blob/d43e1f06/sql/core/src/main/java/org/apache/spark/sql/sources/v2/WriteSupport.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/WriteSupport.java b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/WriteSupport.java index a8a9615..8fdfdfd 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/WriteSupport.java +++ b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/WriteSupport.java @@ -36,8 +36,8 @@ public interface WriteSupport { * sources can return None if there is no writing needed to be done according to the save mode. * * @param jobId A unique string for the writing job. It's possible that there are many writing - * jobs running at the same time, and the returned {@link DataSourceV2Writer} should - * use this job id to distinguish itself with writers of other jobs. + * jobs running at the same time, and the returned {@link DataSourceV2Writer} can + * use this job id to distinguish itself from other jobs. * @param schema the schema of the data to be written. * @param mode the save mode which determines what to do when the data are already in this data * source, please refer to {@link SaveMode} for more details. http://git-wip-us.apache.org/repos/asf/spark/blob/d43e1f06/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java index 5989a4a..88c3219 100644 ---
spark git commit: [SPARK-22190][CORE] Add Spark executor task metrics to Dropwizard metrics
Repository: spark Updated Branches: refs/heads/master 444bce1c9 -> 1ffe03d9e [SPARK-22190][CORE] Add Spark executor task metrics to Dropwizard metrics ## What changes were proposed in this pull request? This proposed patch is about making Spark executor task metrics available as Dropwizard metrics. This is intended to be of aid in monitoring Spark jobs and when drilling down on performance troubleshooting issues. ## How was this patch tested? Manually tested on a Spark cluster (see JIRA for an example screenshot). Author: LucaCanaliCloses #19426 from LucaCanali/SparkTaskMetricsDropWizard. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1ffe03d9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1ffe03d9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1ffe03d9 Branch: refs/heads/master Commit: 1ffe03d9e87fb784cc8a0bae232c81c7b14deac9 Parents: 444bce1 Author: LucaCanali Authored: Wed Nov 1 15:40:25 2017 +0100 Committer: Wenchen Fan Committed: Wed Nov 1 15:40:25 2017 +0100 -- .../org/apache/spark/executor/Executor.scala| 41 + .../apache/spark/executor/ExecutorSource.scala | 48 2 files changed, 89 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1ffe03d9/core/src/main/scala/org/apache/spark/executor/Executor.scala -- diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala b/core/src/main/scala/org/apache/spark/executor/Executor.scala index 2ecbb74..e3e555e 100644 --- a/core/src/main/scala/org/apache/spark/executor/Executor.scala +++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala @@ -406,6 +406,47 @@ private[spark] class Executor( task.metrics.setJvmGCTime(computeTotalGcTime() - startGCTime) task.metrics.setResultSerializationTime(afterSerialization - beforeSerialization) +// Expose task metrics using the Dropwizard metrics system. +// Update task metrics counters +executorSource.METRIC_CPU_TIME.inc(task.metrics.executorCpuTime) +executorSource.METRIC_RUN_TIME.inc(task.metrics.executorRunTime) +executorSource.METRIC_JVM_GC_TIME.inc(task.metrics.jvmGCTime) + executorSource.METRIC_DESERIALIZE_TIME.inc(task.metrics.executorDeserializeTime) + executorSource.METRIC_DESERIALIZE_CPU_TIME.inc(task.metrics.executorDeserializeCpuTime) + executorSource.METRIC_RESULT_SERIALIZE_TIME.inc(task.metrics.resultSerializationTime) +executorSource.METRIC_SHUFFLE_FETCH_WAIT_TIME + .inc(task.metrics.shuffleReadMetrics.fetchWaitTime) + executorSource.METRIC_SHUFFLE_WRITE_TIME.inc(task.metrics.shuffleWriteMetrics.writeTime) +executorSource.METRIC_SHUFFLE_TOTAL_BYTES_READ + .inc(task.metrics.shuffleReadMetrics.totalBytesRead) +executorSource.METRIC_SHUFFLE_REMOTE_BYTES_READ + .inc(task.metrics.shuffleReadMetrics.remoteBytesRead) +executorSource.METRIC_SHUFFLE_REMOTE_BYTES_READ_TO_DISK + .inc(task.metrics.shuffleReadMetrics.remoteBytesReadToDisk) +executorSource.METRIC_SHUFFLE_LOCAL_BYTES_READ + .inc(task.metrics.shuffleReadMetrics.localBytesRead) +executorSource.METRIC_SHUFFLE_RECORDS_READ + .inc(task.metrics.shuffleReadMetrics.recordsRead) +executorSource.METRIC_SHUFFLE_REMOTE_BLOCKS_FETCHED + .inc(task.metrics.shuffleReadMetrics.remoteBlocksFetched) +executorSource.METRIC_SHUFFLE_LOCAL_BLOCKS_FETCHED + .inc(task.metrics.shuffleReadMetrics.localBlocksFetched) +executorSource.METRIC_SHUFFLE_BYTES_WRITTEN + .inc(task.metrics.shuffleWriteMetrics.bytesWritten) +executorSource.METRIC_SHUFFLE_RECORDS_WRITTEN + .inc(task.metrics.shuffleWriteMetrics.recordsWritten) +executorSource.METRIC_INPUT_BYTES_READ + .inc(task.metrics.inputMetrics.bytesRead) +executorSource.METRIC_INPUT_RECORDS_READ + .inc(task.metrics.inputMetrics.recordsRead) +executorSource.METRIC_OUTPUT_BYTES_WRITTEN + .inc(task.metrics.outputMetrics.bytesWritten) +executorSource.METRIC_OUTPUT_RECORDS_WRITTEN + .inc(task.metrics.inputMetrics.recordsRead) +executorSource.METRIC_RESULT_SIZE.inc(task.metrics.resultSize) + executorSource.METRIC_DISK_BYTES_SPILLED.inc(task.metrics.diskBytesSpilled) + executorSource.METRIC_MEMORY_BYTES_SPILLED.inc(task.metrics.memoryBytesSpilled) + // Note: accumulator updates must be collected after TaskMetrics is updated val accumUpdates = task.collectAccumulatorUpdates() // TODO: do not serialize value twice
spark git commit: [SPARK-19112][CORE] Support for ZStandard codec
Repository: spark Updated Branches: refs/heads/master 07f390a27 -> 444bce1c9 [SPARK-19112][CORE] Support for ZStandard codec ## What changes were proposed in this pull request? Using zstd compression for Spark jobs spilling 100s of TBs of data, we could reduce the amount of data written to disk by as much as 50%. This translates to significant latency gain because of reduced disk io operations. There is a degradation CPU time by 2 - 5% because of zstd compression overhead, but for jobs which are bottlenecked by disk IO, this hit can be taken. ## Benchmark Please note that this benchmark is using real world compute heavy production workload spilling TBs of data to disk | | zstd performance as compred to LZ4 | | - | -:| | spill/shuffle bytes| -48% | | cpu time|+ 3% | | cpu reservation time |-40%| | latency | -40% | ## How was this patch tested? Tested by running few jobs spilling large amount of data on the cluster and amount of intermediate data written to disk reduced by as much as 50%. Author: Sital KediaCloses #18805 from sitalkedia/skedia/upstream_zstd. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/444bce1c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/444bce1c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/444bce1c Branch: refs/heads/master Commit: 444bce1c98c45147fe63e2132e9743a0c5e49598 Parents: 07f390a Author: Sital Kedia Authored: Wed Nov 1 14:54:08 2017 +0100 Committer: Herman van Hovell Committed: Wed Nov 1 14:54:08 2017 +0100 -- LICENSE | 2 ++ core/pom.xml| 4 +++ .../org/apache/spark/io/CompressionCodec.scala | 36 ++-- .../apache/spark/io/CompressionCodecSuite.scala | 18 ++ dev/deps/spark-deps-hadoop-2.6 | 1 + dev/deps/spark-deps-hadoop-2.7 | 1 + docs/configuration.md | 20 ++- licenses/LICENSE-zstd-jni.txt | 26 ++ licenses/LICENSE-zstd.txt | 30 pom.xml | 5 +++ 10 files changed, 140 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/444bce1c/LICENSE -- diff --git a/LICENSE b/LICENSE index 39fe0dc..c2b0d72 100644 --- a/LICENSE +++ b/LICENSE @@ -269,6 +269,8 @@ The text of each license is also included at licenses/LICENSE-[project].txt. (BSD 3 Clause) d3.min.js (https://github.com/mbostock/d3/blob/master/LICENSE) (BSD 3 Clause) DPark (https://github.com/douban/dpark/blob/master/LICENSE) (BSD 3 Clause) CloudPickle (https://github.com/cloudpipe/cloudpickle/blob/master/LICENSE) + (BSD 2 Clause) Zstd-jni (https://github.com/luben/zstd-jni/blob/master/LICENSE) + (BSD license) Zstd (https://github.com/facebook/zstd/blob/v1.3.1/LICENSE) MIT licenses http://git-wip-us.apache.org/repos/asf/spark/blob/444bce1c/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 54f7a34..fa138d3 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -199,6 +199,10 @@ lz4-java + com.github.luben + zstd-jni + + org.roaringbitmap RoaringBitmap http://git-wip-us.apache.org/repos/asf/spark/blob/444bce1c/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala -- diff --git a/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala b/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala index 27f2e42..7722db5 100644 --- a/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala +++ b/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala @@ -20,6 +20,7 @@ package org.apache.spark.io import java.io._ import java.util.Locale +import com.github.luben.zstd.{ZstdInputStream, ZstdOutputStream} import com.ning.compress.lzf.{LZFInputStream, LZFOutputStream} import net.jpountz.lz4.{LZ4BlockInputStream, LZ4BlockOutputStream} import org.xerial.snappy.{Snappy, SnappyInputStream, SnappyOutputStream} @@ -50,13 +51,14 @@ private[spark] object CompressionCodec { private[spark] def supportsConcatenationOfSerializedStreams(codec: CompressionCodec): Boolean = { (codec.isInstanceOf[SnappyCompressionCodec] || codec.isInstanceOf[LZFCompressionCodec] - || codec.isInstanceOf[LZ4CompressionCodec]) + || codec.isInstanceOf[LZ4CompressionCodec]
spark git commit: [SPARK-22347][PYSPARK][DOC] Add document to notice users for using udfs with conditional expressions
Repository: spark Updated Branches: refs/heads/master 96798d14f -> 07f390a27 [SPARK-22347][PYSPARK][DOC] Add document to notice users for using udfs with conditional expressions ## What changes were proposed in this pull request? Under the current execution mode of Python UDFs, we don't well support Python UDFs as branch values or else value in CaseWhen expression. Since to fix it might need the change not small (e.g., #19592) and this issue has simpler workaround. We should just notice users in the document about this. ## How was this patch tested? Only document change. Author: Liang-Chi HsiehCloses #19617 from viirya/SPARK-22347-3. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07f390a2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07f390a2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07f390a2 Branch: refs/heads/master Commit: 07f390a27d7b793291c352a643d4bbd5f47294a6 Parents: 96798d1 Author: Liang-Chi Hsieh Authored: Wed Nov 1 13:09:35 2017 +0100 Committer: Wenchen Fan Committed: Wed Nov 1 13:09:35 2017 +0100 -- python/pyspark/sql/functions.py | 14 ++ 1 file changed, 14 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/07f390a2/python/pyspark/sql/functions.py -- diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 0d40368..3981549 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -2185,6 +2185,13 @@ def udf(f=None, returnType=StringType()): duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query. +.. note:: The user-defined functions do not support conditional execution by using them with +SQL conditional expressions such as `when` or `if`. The functions still apply on all rows no +matter the conditions are met or not. So the output is correct if the functions can be +correctly run on all rows without failure. If the functions can cause runtime failure on the +rows that do not satisfy the conditions, the suggested workaround is to incorporate the +condition logic into the functions. + :param f: python function if used as a standalone function :param returnType: a :class:`pyspark.sql.types.DataType` object @@ -2278,6 +2285,13 @@ def pandas_udf(f=None, returnType=StringType()): .. seealso:: :meth:`pyspark.sql.GroupedData.apply` .. note:: The user-defined function must be deterministic. + +.. note:: The user-defined functions do not support conditional execution by using them with +SQL conditional expressions such as `when` or `if`. The functions still apply on all rows no +matter the conditions are met or not. So the output is correct if the functions can be +correctly run on all rows without failure. If the functions can cause runtime failure on the +rows that do not satisfy the conditions, the suggested workaround is to incorporate the +condition logic into the functions. """ return _create_udf(f, returnType=returnType, pythonUdfType=PythonUdfType.PANDAS_UDF) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-22172][CORE] Worker hangs when the external shuffle service port is already in use
Repository: spark Updated Branches: refs/heads/master 556b5d215 -> 96798d14f [SPARK-22172][CORE] Worker hangs when the external shuffle service port is already in use ## What changes were proposed in this pull request? Handling the NonFatal exceptions while starting the external shuffle service, if there are any NonFatal exceptions it logs and continues without the external shuffle service. ## How was this patch tested? I verified it manually, it logs the exception and continues to serve without external shuffle service when BindException occurs. Author: Devaraj KCloses #19396 from devaraj-kavali/SPARK-22172. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/96798d14 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/96798d14 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/96798d14 Branch: refs/heads/master Commit: 96798d14f07208796fa0a90af0ab369879bacd6c Parents: 556b5d2 Author: Devaraj K Authored: Wed Nov 1 18:07:39 2017 +0800 Committer: jerryshao Committed: Wed Nov 1 18:07:39 2017 +0800 -- .../scala/org/apache/spark/deploy/worker/Worker.scala | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/96798d14/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala index ed5fa4b..3962d42 100755 --- a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala +++ b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala @@ -199,7 +199,7 @@ private[deploy] class Worker( logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}") logInfo("Spark home: " + sparkHome) createWorkDir() -shuffleService.startIfEnabled() +startExternalShuffleService() webUi = new WorkerWebUI(this, workDir, webUiPort) webUi.bind() @@ -367,6 +367,16 @@ private[deploy] class Worker( } } + private def startExternalShuffleService() { +try { + shuffleService.startIfEnabled() +} catch { + case e: Exception => +logError("Failed to start external shuffle service", e) +System.exit(1) +} + } + private def sendRegisterMessageToMaster(masterEndpoint: RpcEndpointRef): Unit = { masterEndpoint.send(RegisterWorker( workerId, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-5484][FOLLOWUP] PeriodicRDDCheckpointer doc cleanup
Repository: spark Updated Branches: refs/heads/master 73231860b -> 556b5d215 [SPARK-5484][FOLLOWUP] PeriodicRDDCheckpointer doc cleanup ## What changes were proposed in this pull request? PeriodicRDDCheckpointer was already moved out of mllib in Spark-5484 ## How was this patch tested? existing tests Author: Zheng RuiFengCloses #19618 from zhengruifeng/checkpointer_doc. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/556b5d21 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/556b5d21 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/556b5d21 Branch: refs/heads/master Commit: 556b5d21512b17027a6e451c6a82fb428940e95a Parents: 7323186 Author: Zheng RuiFeng Authored: Wed Nov 1 08:45:11 2017 + Committer: Sean Owen Committed: Wed Nov 1 08:45:11 2017 + -- .../scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala | 2 -- 1 file changed, 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/556b5d21/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala -- diff --git a/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala b/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala index facbb83..5e181a9 100644 --- a/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala +++ b/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala @@ -73,8 +73,6 @@ import org.apache.spark.util.PeriodicCheckpointer * * @param checkpointInterval RDDs will be checkpointed at this interval * @tparam T RDD element type - * - * TODO: Move this out of MLlib? */ private[spark] class PeriodicRDDCheckpointer[T]( checkpointInterval: Int, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org