spark git commit: [MINOR][DOC] automatic type inference supports also Date and Timestamp

2017-11-01 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 3d6d88996 -> 8b572116f


[MINOR][DOC] automatic type inference supports also Date and Timestamp

## What changes were proposed in this pull request?

Easy fix in the documentation, which is reporting that only numeric types and 
string are supported in type inference for partition columns, while Date and 
Timestamp are supported too since 2.1.0, thanks to SPARK-17388.

## How was this patch tested?

n/a

Author: Marco Gaido 

Closes #19628 from mgaido91/SPARK-22398.

(cherry picked from commit b04eefae49b96e2ef5a8d75334db29ef4e19ce58)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8b572116
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8b572116
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8b572116

Branch: refs/heads/branch-2.1
Commit: 8b572116f8b9220f8c041c2f2f5c239fed947477
Parents: 3d6d889
Author: Marco Gaido 
Authored: Thu Nov 2 09:30:03 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Nov 2 09:30:35 2017 +0900

--
 docs/sql-programming-guide.md | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8b572116/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index e72a0be..92fa046 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -682,10 +682,11 @@ root
 {% endhighlight %}
 
 Notice that the data types of the partitioning columns are automatically 
inferred. Currently,
-numeric data types and string type are supported. Sometimes users may not want 
to automatically
-infer the data types of the partitioning columns. For these use cases, the 
automatic type inference
-can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, 
which is default to
-`true`. When type inference is disabled, string type will be used for the 
partitioning columns.
+numeric data types, date, timestamp and string type are supported. Sometimes 
users may not want
+to automatically infer the data types of the partitioning columns. For these 
use cases, the
+automatic type inference can be configured by
+`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to 
`true`. When type
+inference is disabled, string type will be used for the partitioning columns.
 
 Starting from Spark 1.6.0, partition discovery only finds partitions under the 
given paths
 by default. For the above example, if users pass `path/to/table/gender=male` 
to either


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][DOC] automatic type inference supports also Date and Timestamp

2017-11-01 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 ab87a92a1 -> c311c5e79


[MINOR][DOC] automatic type inference supports also Date and Timestamp

## What changes were proposed in this pull request?

Easy fix in the documentation, which is reporting that only numeric types and 
string are supported in type inference for partition columns, while Date and 
Timestamp are supported too since 2.1.0, thanks to SPARK-17388.

## How was this patch tested?

n/a

Author: Marco Gaido 

Closes #19628 from mgaido91/SPARK-22398.

(cherry picked from commit b04eefae49b96e2ef5a8d75334db29ef4e19ce58)
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c311c5e7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c311c5e7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c311c5e7

Branch: refs/heads/branch-2.2
Commit: c311c5e7976a60a0f67d913ca10fe72a8559148f
Parents: ab87a92
Author: Marco Gaido 
Authored: Thu Nov 2 09:30:03 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Nov 2 09:30:20 2017 +0900

--
 docs/sql-programming-guide.md | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c311c5e7/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 9a54adc..8cd4d05 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -778,10 +778,11 @@ root
 {% endhighlight %}
 
 Notice that the data types of the partitioning columns are automatically 
inferred. Currently,
-numeric data types and string type are supported. Sometimes users may not want 
to automatically
-infer the data types of the partitioning columns. For these use cases, the 
automatic type inference
-can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, 
which is default to
-`true`. When type inference is disabled, string type will be used for the 
partitioning columns.
+numeric data types, date, timestamp and string type are supported. Sometimes 
users may not want
+to automatically infer the data types of the partitioning columns. For these 
use cases, the
+automatic type inference can be configured by
+`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to 
`true`. When type
+inference is disabled, string type will be used for the partitioning columns.
 
 Starting from Spark 1.6.0, partition discovery only finds partitions under the 
given paths
 by default. For the above example, if users pass `path/to/table/gender=male` 
to either


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][DOC] automatic type inference supports also Date and Timestamp

2017-11-01 Thread gurwls223
Repository: spark
Updated Branches:
  refs/heads/master d43e1f06b -> b04eefae4


[MINOR][DOC] automatic type inference supports also Date and Timestamp

## What changes were proposed in this pull request?

Easy fix in the documentation, which is reporting that only numeric types and 
string are supported in type inference for partition columns, while Date and 
Timestamp are supported too since 2.1.0, thanks to SPARK-17388.

## How was this patch tested?

n/a

Author: Marco Gaido 

Closes #19628 from mgaido91/SPARK-22398.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b04eefae
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b04eefae
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b04eefae

Branch: refs/heads/master
Commit: b04eefae49b96e2ef5a8d75334db29ef4e19ce58
Parents: d43e1f0
Author: Marco Gaido 
Authored: Thu Nov 2 09:30:03 2017 +0900
Committer: hyukjinkwon 
Committed: Thu Nov 2 09:30:03 2017 +0900

--
 docs/sql-programming-guide.md | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b04eefae/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 639a8ea..ce37787 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -800,10 +800,11 @@ root
 {% endhighlight %}
 
 Notice that the data types of the partitioning columns are automatically 
inferred. Currently,
-numeric data types and string type are supported. Sometimes users may not want 
to automatically
-infer the data types of the partitioning columns. For these use cases, the 
automatic type inference
-can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, 
which is default to
-`true`. When type inference is disabled, string type will be used for the 
partitioning columns.
+numeric data types, date, timestamp and string type are supported. Sometimes 
users may not want
+to automatically infer the data types of the partitioning columns. For these 
use cases, the
+automatic type inference can be configured by
+`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to 
`true`. When type
+inference is disabled, string type will be used for the partitioning columns.
 
 Starting from Spark 1.6.0, partition discovery only finds partitions under the 
given paths
 by default. For the above example, if users pass `path/to/table/gender=male` 
to either


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR] Data source v2 docs update.

2017-11-01 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 1ffe03d9e -> d43e1f06b


[MINOR] Data source v2 docs update.

## What changes were proposed in this pull request?
This patch includes some doc updates for data source API v2. I was reading the 
code and noticed some minor issues.

## How was this patch tested?
This is a doc only change.

Author: Reynold Xin 

Closes #19626 from rxin/dsv2-update.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d43e1f06
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d43e1f06
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d43e1f06

Branch: refs/heads/master
Commit: d43e1f06bd545d00bfcaf1efb388b469effd5d64
Parents: 1ffe03d
Author: Reynold Xin 
Authored: Wed Nov 1 18:39:15 2017 +0100
Committer: Reynold Xin 
Committed: Wed Nov 1 18:39:15 2017 +0100

--
 .../org/apache/spark/sql/sources/v2/DataSourceV2.java|  9 -
 .../org/apache/spark/sql/sources/v2/WriteSupport.java|  4 ++--
 .../spark/sql/sources/v2/reader/DataSourceV2Reader.java  | 10 +-
 .../v2/reader/SupportsPushDownCatalystFilters.java   |  2 --
 .../sql/sources/v2/reader/SupportsScanUnsafeRow.java |  2 --
 .../spark/sql/sources/v2/writer/DataSourceV2Writer.java  | 11 +++
 .../apache/spark/sql/sources/v2/writer/DataWriter.java   | 10 +-
 7 files changed, 19 insertions(+), 29 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d43e1f06/sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java
index dbcbe32..6234071 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java
@@ -20,12 +20,11 @@ package org.apache.spark.sql.sources.v2;
 import org.apache.spark.annotation.InterfaceStability;
 
 /**
- * The base interface for data source v2. Implementations must have a public, 
no arguments
- * constructor.
+ * The base interface for data source v2. Implementations must have a public, 
0-arg constructor.
  *
- * Note that this is an empty interface, data source implementations should 
mix-in at least one of
- * the plug-in interfaces like {@link ReadSupport}. Otherwise it's just a 
dummy data source which is
- * un-readable/writable.
+ * Note that this is an empty interface. Data source implementations should 
mix-in at least one of
+ * the plug-in interfaces like {@link ReadSupport} and {@link WriteSupport}. 
Otherwise it's just
+ * a dummy data source which is un-readable/writable.
  */
 @InterfaceStability.Evolving
 public interface DataSourceV2 {}

http://git-wip-us.apache.org/repos/asf/spark/blob/d43e1f06/sql/core/src/main/java/org/apache/spark/sql/sources/v2/WriteSupport.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/WriteSupport.java 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/WriteSupport.java
index a8a9615..8fdfdfd 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/WriteSupport.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/WriteSupport.java
@@ -36,8 +36,8 @@ public interface WriteSupport {
* sources can return None if there is no writing needed to be done 
according to the save mode.
*
* @param jobId A unique string for the writing job. It's possible that 
there are many writing
-   *  jobs running at the same time, and the returned {@link 
DataSourceV2Writer} should
-   *  use this job id to distinguish itself with writers of other 
jobs.
+   *  jobs running at the same time, and the returned {@link 
DataSourceV2Writer} can
+   *  use this job id to distinguish itself from other jobs.
* @param schema the schema of the data to be written.
* @param mode the save mode which determines what to do when the data are 
already in this data
* source, please refer to {@link SaveMode} for more details.

http://git-wip-us.apache.org/repos/asf/spark/blob/d43e1f06/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java
 
b/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java
index 5989a4a..88c3219 100644
--- 

spark git commit: [SPARK-22190][CORE] Add Spark executor task metrics to Dropwizard metrics

2017-11-01 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 444bce1c9 -> 1ffe03d9e


[SPARK-22190][CORE] Add Spark executor task metrics to Dropwizard metrics

## What changes were proposed in this pull request?

This proposed patch is about making Spark executor task metrics available as 
Dropwizard metrics. This is intended to be of aid in monitoring Spark jobs and 
when drilling down on performance troubleshooting issues.

## How was this patch tested?

Manually tested on a Spark cluster (see JIRA for an example screenshot).

Author: LucaCanali 

Closes #19426 from LucaCanali/SparkTaskMetricsDropWizard.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1ffe03d9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1ffe03d9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1ffe03d9

Branch: refs/heads/master
Commit: 1ffe03d9e87fb784cc8a0bae232c81c7b14deac9
Parents: 444bce1
Author: LucaCanali 
Authored: Wed Nov 1 15:40:25 2017 +0100
Committer: Wenchen Fan 
Committed: Wed Nov 1 15:40:25 2017 +0100

--
 .../org/apache/spark/executor/Executor.scala| 41 +
 .../apache/spark/executor/ExecutorSource.scala  | 48 
 2 files changed, 89 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1ffe03d9/core/src/main/scala/org/apache/spark/executor/Executor.scala
--
diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala 
b/core/src/main/scala/org/apache/spark/executor/Executor.scala
index 2ecbb74..e3e555e 100644
--- a/core/src/main/scala/org/apache/spark/executor/Executor.scala
+++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala
@@ -406,6 +406,47 @@ private[spark] class Executor(
 task.metrics.setJvmGCTime(computeTotalGcTime() - startGCTime)
 task.metrics.setResultSerializationTime(afterSerialization - 
beforeSerialization)
 
+// Expose task metrics using the Dropwizard metrics system.
+// Update task metrics counters
+executorSource.METRIC_CPU_TIME.inc(task.metrics.executorCpuTime)
+executorSource.METRIC_RUN_TIME.inc(task.metrics.executorRunTime)
+executorSource.METRIC_JVM_GC_TIME.inc(task.metrics.jvmGCTime)
+
executorSource.METRIC_DESERIALIZE_TIME.inc(task.metrics.executorDeserializeTime)
+
executorSource.METRIC_DESERIALIZE_CPU_TIME.inc(task.metrics.executorDeserializeCpuTime)
+
executorSource.METRIC_RESULT_SERIALIZE_TIME.inc(task.metrics.resultSerializationTime)
+executorSource.METRIC_SHUFFLE_FETCH_WAIT_TIME
+  .inc(task.metrics.shuffleReadMetrics.fetchWaitTime)
+
executorSource.METRIC_SHUFFLE_WRITE_TIME.inc(task.metrics.shuffleWriteMetrics.writeTime)
+executorSource.METRIC_SHUFFLE_TOTAL_BYTES_READ
+  .inc(task.metrics.shuffleReadMetrics.totalBytesRead)
+executorSource.METRIC_SHUFFLE_REMOTE_BYTES_READ
+  .inc(task.metrics.shuffleReadMetrics.remoteBytesRead)
+executorSource.METRIC_SHUFFLE_REMOTE_BYTES_READ_TO_DISK
+  .inc(task.metrics.shuffleReadMetrics.remoteBytesReadToDisk)
+executorSource.METRIC_SHUFFLE_LOCAL_BYTES_READ
+  .inc(task.metrics.shuffleReadMetrics.localBytesRead)
+executorSource.METRIC_SHUFFLE_RECORDS_READ
+  .inc(task.metrics.shuffleReadMetrics.recordsRead)
+executorSource.METRIC_SHUFFLE_REMOTE_BLOCKS_FETCHED
+  .inc(task.metrics.shuffleReadMetrics.remoteBlocksFetched)
+executorSource.METRIC_SHUFFLE_LOCAL_BLOCKS_FETCHED
+  .inc(task.metrics.shuffleReadMetrics.localBlocksFetched)
+executorSource.METRIC_SHUFFLE_BYTES_WRITTEN
+  .inc(task.metrics.shuffleWriteMetrics.bytesWritten)
+executorSource.METRIC_SHUFFLE_RECORDS_WRITTEN
+  .inc(task.metrics.shuffleWriteMetrics.recordsWritten)
+executorSource.METRIC_INPUT_BYTES_READ
+  .inc(task.metrics.inputMetrics.bytesRead)
+executorSource.METRIC_INPUT_RECORDS_READ
+  .inc(task.metrics.inputMetrics.recordsRead)
+executorSource.METRIC_OUTPUT_BYTES_WRITTEN
+  .inc(task.metrics.outputMetrics.bytesWritten)
+executorSource.METRIC_OUTPUT_RECORDS_WRITTEN
+  .inc(task.metrics.inputMetrics.recordsRead)
+executorSource.METRIC_RESULT_SIZE.inc(task.metrics.resultSize)
+
executorSource.METRIC_DISK_BYTES_SPILLED.inc(task.metrics.diskBytesSpilled)
+
executorSource.METRIC_MEMORY_BYTES_SPILLED.inc(task.metrics.memoryBytesSpilled)
+
 // Note: accumulator updates must be collected after TaskMetrics is 
updated
 val accumUpdates = task.collectAccumulatorUpdates()
 // TODO: do not serialize value twice


spark git commit: [SPARK-19112][CORE] Support for ZStandard codec

2017-11-01 Thread hvanhovell
Repository: spark
Updated Branches:
  refs/heads/master 07f390a27 -> 444bce1c9


[SPARK-19112][CORE] Support for ZStandard codec

## What changes were proposed in this pull request?

Using zstd compression for Spark jobs spilling 100s of TBs of data, we could 
reduce the amount of data written to disk by as much as 50%. This translates to 
significant latency gain because of reduced disk io operations. There is a 
degradation CPU time by 2 - 5% because of zstd compression overhead, but for 
jobs which are bottlenecked by disk IO, this hit can be taken.

## Benchmark
Please note that this benchmark is using real world compute heavy production 
workload spilling TBs of data to disk

| | zstd performance as compred to LZ4   |
| - | -:|
| spill/shuffle bytes| -48% |
| cpu time|+ 3% |
| cpu reservation time   |-40%|
| latency | -40% |

## How was this patch tested?

Tested by running few jobs spilling large amount of data on the cluster and 
amount of intermediate data written to disk reduced by as much as 50%.

Author: Sital Kedia 

Closes #18805 from sitalkedia/skedia/upstream_zstd.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/444bce1c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/444bce1c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/444bce1c

Branch: refs/heads/master
Commit: 444bce1c98c45147fe63e2132e9743a0c5e49598
Parents: 07f390a
Author: Sital Kedia 
Authored: Wed Nov 1 14:54:08 2017 +0100
Committer: Herman van Hovell 
Committed: Wed Nov 1 14:54:08 2017 +0100

--
 LICENSE |  2 ++
 core/pom.xml|  4 +++
 .../org/apache/spark/io/CompressionCodec.scala  | 36 ++--
 .../apache/spark/io/CompressionCodecSuite.scala | 18 ++
 dev/deps/spark-deps-hadoop-2.6  |  1 +
 dev/deps/spark-deps-hadoop-2.7  |  1 +
 docs/configuration.md   | 20 ++-
 licenses/LICENSE-zstd-jni.txt   | 26 ++
 licenses/LICENSE-zstd.txt   | 30 
 pom.xml |  5 +++
 10 files changed, 140 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/444bce1c/LICENSE
--
diff --git a/LICENSE b/LICENSE
index 39fe0dc..c2b0d72 100644
--- a/LICENSE
+++ b/LICENSE
@@ -269,6 +269,8 @@ The text of each license is also included at 
licenses/LICENSE-[project].txt.
  (BSD 3 Clause) d3.min.js 
(https://github.com/mbostock/d3/blob/master/LICENSE)
  (BSD 3 Clause) DPark (https://github.com/douban/dpark/blob/master/LICENSE)
  (BSD 3 Clause) CloudPickle 
(https://github.com/cloudpipe/cloudpickle/blob/master/LICENSE)
+ (BSD 2 Clause) Zstd-jni 
(https://github.com/luben/zstd-jni/blob/master/LICENSE)
+ (BSD license) Zstd (https://github.com/facebook/zstd/blob/v1.3.1/LICENSE)
 
 
 MIT licenses

http://git-wip-us.apache.org/repos/asf/spark/blob/444bce1c/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index 54f7a34..fa138d3 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -199,6 +199,10 @@
   lz4-java
 
 
+  com.github.luben
+  zstd-jni
+
+
   org.roaringbitmap
   RoaringBitmap
 

http://git-wip-us.apache.org/repos/asf/spark/blob/444bce1c/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala
--
diff --git a/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala 
b/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala
index 27f2e42..7722db5 100644
--- a/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala
+++ b/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala
@@ -20,6 +20,7 @@ package org.apache.spark.io
 import java.io._
 import java.util.Locale
 
+import com.github.luben.zstd.{ZstdInputStream, ZstdOutputStream}
 import com.ning.compress.lzf.{LZFInputStream, LZFOutputStream}
 import net.jpountz.lz4.{LZ4BlockInputStream, LZ4BlockOutputStream}
 import org.xerial.snappy.{Snappy, SnappyInputStream, SnappyOutputStream}
@@ -50,13 +51,14 @@ private[spark] object CompressionCodec {
 
   private[spark] def supportsConcatenationOfSerializedStreams(codec: 
CompressionCodec): Boolean = {
 (codec.isInstanceOf[SnappyCompressionCodec] || 
codec.isInstanceOf[LZFCompressionCodec]
-  || codec.isInstanceOf[LZ4CompressionCodec])
+  || codec.isInstanceOf[LZ4CompressionCodec] 

spark git commit: [SPARK-22347][PYSPARK][DOC] Add document to notice users for using udfs with conditional expressions

2017-11-01 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 96798d14f -> 07f390a27


[SPARK-22347][PYSPARK][DOC] Add document to notice users for using udfs with 
conditional expressions

## What changes were proposed in this pull request?

Under the current execution mode of Python UDFs, we don't well support Python 
UDFs as branch values or else value in CaseWhen expression.

Since to fix it might need the change not small (e.g., #19592) and this issue 
has simpler workaround. We should just notice users in the document about this.

## How was this patch tested?

Only document change.

Author: Liang-Chi Hsieh 

Closes #19617 from viirya/SPARK-22347-3.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/07f390a2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/07f390a2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/07f390a2

Branch: refs/heads/master
Commit: 07f390a27d7b793291c352a643d4bbd5f47294a6
Parents: 96798d1
Author: Liang-Chi Hsieh 
Authored: Wed Nov 1 13:09:35 2017 +0100
Committer: Wenchen Fan 
Committed: Wed Nov 1 13:09:35 2017 +0100

--
 python/pyspark/sql/functions.py | 14 ++
 1 file changed, 14 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/07f390a2/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 0d40368..3981549 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2185,6 +2185,13 @@ def udf(f=None, returnType=StringType()):
 duplicate invocations may be eliminated or the function may even be 
invoked more times than
 it is present in the query.
 
+.. note:: The user-defined functions do not support conditional execution 
by using them with
+SQL conditional expressions such as `when` or `if`. The functions 
still apply on all rows no
+matter the conditions are met or not. So the output is correct if the 
functions can be
+correctly run on all rows without failure. If the functions can cause 
runtime failure on the
+rows that do not satisfy the conditions, the suggested workaround is 
to incorporate the
+condition logic into the functions.
+
 :param f: python function if used as a standalone function
 :param returnType: a :class:`pyspark.sql.types.DataType` object
 
@@ -2278,6 +2285,13 @@ def pandas_udf(f=None, returnType=StringType()):
.. seealso:: :meth:`pyspark.sql.GroupedData.apply`
 
 .. note:: The user-defined function must be deterministic.
+
+.. note:: The user-defined functions do not support conditional execution 
by using them with
+SQL conditional expressions such as `when` or `if`. The functions 
still apply on all rows no
+matter the conditions are met or not. So the output is correct if the 
functions can be
+correctly run on all rows without failure. If the functions can cause 
runtime failure on the
+rows that do not satisfy the conditions, the suggested workaround is 
to incorporate the
+condition logic into the functions.
 """
 return _create_udf(f, returnType=returnType, 
pythonUdfType=PythonUdfType.PANDAS_UDF)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-22172][CORE] Worker hangs when the external shuffle service port is already in use

2017-11-01 Thread jshao
Repository: spark
Updated Branches:
  refs/heads/master 556b5d215 -> 96798d14f


[SPARK-22172][CORE] Worker hangs when the external shuffle service port is 
already in use

## What changes were proposed in this pull request?

Handling the NonFatal exceptions while starting the external shuffle service, 
if there are any NonFatal exceptions it logs and continues without the external 
shuffle service.

## How was this patch tested?

I verified it manually, it logs the exception and continues to serve without 
external shuffle service when BindException occurs.

Author: Devaraj K 

Closes #19396 from devaraj-kavali/SPARK-22172.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/96798d14
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/96798d14
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/96798d14

Branch: refs/heads/master
Commit: 96798d14f07208796fa0a90af0ab369879bacd6c
Parents: 556b5d2
Author: Devaraj K 
Authored: Wed Nov 1 18:07:39 2017 +0800
Committer: jerryshao 
Committed: Wed Nov 1 18:07:39 2017 +0800

--
 .../scala/org/apache/spark/deploy/worker/Worker.scala   | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/96798d14/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
--
diff --git a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
index ed5fa4b..3962d42 100755
--- a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
@@ -199,7 +199,7 @@ private[deploy] class Worker(
 logInfo(s"Running Spark version ${org.apache.spark.SPARK_VERSION}")
 logInfo("Spark home: " + sparkHome)
 createWorkDir()
-shuffleService.startIfEnabled()
+startExternalShuffleService()
 webUi = new WorkerWebUI(this, workDir, webUiPort)
 webUi.bind()
 
@@ -367,6 +367,16 @@ private[deploy] class Worker(
 }
   }
 
+  private def startExternalShuffleService() {
+try {
+  shuffleService.startIfEnabled()
+} catch {
+  case e: Exception =>
+logError("Failed to start external shuffle service", e)
+System.exit(1)
+}
+  }
+
   private def sendRegisterMessageToMaster(masterEndpoint: RpcEndpointRef): 
Unit = {
 masterEndpoint.send(RegisterWorker(
   workerId,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-5484][FOLLOWUP] PeriodicRDDCheckpointer doc cleanup

2017-11-01 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 73231860b -> 556b5d215


[SPARK-5484][FOLLOWUP] PeriodicRDDCheckpointer doc cleanup

## What changes were proposed in this pull request?
PeriodicRDDCheckpointer was already moved out of mllib in Spark-5484

## How was this patch tested?
existing tests

Author: Zheng RuiFeng 

Closes #19618 from zhengruifeng/checkpointer_doc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/556b5d21
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/556b5d21
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/556b5d21

Branch: refs/heads/master
Commit: 556b5d21512b17027a6e451c6a82fb428940e95a
Parents: 7323186
Author: Zheng RuiFeng 
Authored: Wed Nov 1 08:45:11 2017 +
Committer: Sean Owen 
Committed: Wed Nov 1 08:45:11 2017 +

--
 .../scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala  | 2 --
 1 file changed, 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/556b5d21/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala 
b/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala
index facbb83..5e181a9 100644
--- 
a/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala
+++ 
b/core/src/main/scala/org/apache/spark/rdd/util/PeriodicRDDCheckpointer.scala
@@ -73,8 +73,6 @@ import org.apache.spark.util.PeriodicCheckpointer
  *
  * @param checkpointInterval  RDDs will be checkpointed at this interval
  * @tparam T  RDD element type
- *
- * TODO: Move this out of MLlib?
  */
 private[spark] class PeriodicRDDCheckpointer[T](
 checkpointInterval: Int,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org