spark git commit: [SPARK-22279][SQL] Enable `convertMetastoreOrc` by default

2018-05-09 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 62d01391f -> e3d434994


[SPARK-22279][SQL] Enable `convertMetastoreOrc` by default

## What changes were proposed in this pull request?

We reverted `spark.sql.hive.convertMetastoreOrc` at 
https://github.com/apache/spark/pull/20536 because we should not ignore the 
table-specific compression conf. Now, it's resolved via 
[SPARK-23355](https://github.com/apache/spark/commit/8aa1d7b0ede5115297541d29eab4ce5f4fe905cb).

## How was this patch tested?

Pass the Jenkins.

Author: Dongjoon Hyun 

Closes #21186 from dongjoon-hyun/SPARK-24112.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e3d43499
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e3d43499
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e3d43499

Branch: refs/heads/master
Commit: e3d434994733ae16e7e1424fb6de2d22b1a13f99
Parents: 62d0139
Author: Dongjoon Hyun 
Authored: Thu May 10 13:36:52 2018 +0800
Committer: Wenchen Fan 
Committed: Thu May 10 13:36:52 2018 +0800

--
 docs/sql-programming-guide.md | 3 ++-
 sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala | 3 +--
 2 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e3d43499/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 3e8946e..3f79ed6 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1017,7 +1017,7 @@ the vectorized reader is used when 
`spark.sql.hive.convertMetastoreOrc` is also
   Property 
NameDefaultMeaning
   
 spark.sql.orc.impl
-hive
+native
 The name of ORC implementation. It can be one of native 
and hive. native means the native ORC support that is 
built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1.
   
   
@@ -1813,6 +1813,7 @@ working with timestamps in `pandas_udf`s to get the best 
performance, see
   - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
   - In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` 
respect the timezone in the input timestamp string, which breaks the assumption 
that the input timestamp is in a specific timezone. Therefore, these 2 
functions can return unexpected results. In version 2.4 and later, this problem 
has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if 
the input timestamp string contains timezone. As an example, 
`from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 
01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 
00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return 
`2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care 
about this problem and want to retain the previous behaivor to keep their query 
unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. 
This option will be removed in Spark 3.0 and should only be used as a temporary 
w
 orkaround.
   - In version 2.3 and earlier, Spark converts Parquet Hive tables by default 
but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. 
This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 
'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 
2.4, Spark respects Parquet/ORC specific table properties while converting 
Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS 
PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy 
parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would 
be uncompressed parquet files.
+  - Since Spark 2.0, Spark converts Parquet Hive tables by default for better 
performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. 
It means Spark uses its own ORC support by default instead of Hive SerDe. As an 
example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive 
SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC 
data source table and ORC vectorization would be applied. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
 
 ## Upgrading From Spark SQL 2.2 to 2.3
 


[2/2] spark git commit: [SPARK-24073][SQL] Rename DataReaderFactory to InputPartition.

2018-05-09 Thread lixiao
[SPARK-24073][SQL] Rename DataReaderFactory to InputPartition.

## What changes were proposed in this pull request?

Renames:
* `DataReaderFactory` to `InputPartition`
* `DataReader` to `InputPartitionReader`
* `createDataReaderFactories` to `planInputPartitions`
* `createUnsafeDataReaderFactories` to `planUnsafeInputPartitions`
* `createBatchDataReaderFactories` to `planBatchInputPartitions`

This fixes the changes in SPARK-23219, which renamed ReadTask to
DataReaderFactory. The intent of that change was to make the read and
write API match (write side uses DataWriterFactory), but the underlying
problem is that the two classes are not equivalent.

ReadTask/DataReader function as Iterable/Iterator. One InputPartition is
a specific partition of the data to be read, in contrast to
DataWriterFactory where the same factory instance is used in all write
tasks. InputPartition's purpose is to manage the lifecycle of the
associated reader, which is now called InputPartitionReader, with an
explicit create operation to mirror the close operation. This was no
longer clear from the API because DataReaderFactory appeared to be more
generic than it is and it isn't clear why a set of them is produced for
a read.

## How was this patch tested?

Existing tests, which have been updated to use the new name.

Author: Ryan Blue 

Closes #21145 from rdblue/SPARK-24073-revert-data-reader-factory-rename.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/62d01391
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/62d01391
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/62d01391

Branch: refs/heads/master
Commit: 62d01391fee77eedd75b4e3f475ede8b9f0df0c2
Parents: 9341c95
Author: Ryan Blue 
Authored: Wed May 9 21:48:54 2018 -0700
Committer: gatorsmile 
Committed: Wed May 9 21:48:54 2018 -0700

--
 .../sql/kafka010/KafkaContinuousReader.scala| 20 ++---
 .../sql/kafka010/KafkaMicroBatchReader.scala| 21 +++---
 .../sql/kafka010/KafkaSourceProvider.scala  |  3 +-
 .../kafka010/KafkaMicroBatchSourceSuite.scala   |  2 +-
 .../sql/sources/v2/MicroBatchReadSupport.java   |  2 +-
 .../v2/reader/ContinuousDataReaderFactory.java  | 35 -
 .../v2/reader/ContinuousInputPartition.java | 35 +
 .../spark/sql/sources/v2/reader/DataReader.java | 53 -
 .../sources/v2/reader/DataReaderFactory.java| 61 ---
 .../sql/sources/v2/reader/DataSourceReader.java | 10 +--
 .../sql/sources/v2/reader/InputPartition.java   | 61 +++
 .../sources/v2/reader/InputPartitionReader.java | 53 +
 .../v2/reader/SupportsReportPartitioning.java   |  2 +-
 .../v2/reader/SupportsScanColumnarBatch.java| 10 +--
 .../v2/reader/SupportsScanUnsafeRow.java|  8 +-
 .../partitioning/ClusteredDistribution.java |  4 +-
 .../v2/reader/partitioning/Distribution.java|  6 +-
 .../v2/reader/partitioning/Partitioning.java|  4 +-
 .../reader/streaming/ContinuousDataReader.java  | 36 -
 .../ContinuousInputPartitionReader.java | 36 +
 .../v2/reader/streaming/ContinuousReader.java   | 10 +--
 .../v2/reader/streaming/MicroBatchReader.java   |  2 +-
 .../datasources/v2/DataSourceRDD.scala  | 11 +--
 .../datasources/v2/DataSourceV2ScanExec.scala   | 46 ++--
 .../continuous/ContinuousDataSourceRDD.scala| 22 +++---
 .../continuous/ContinuousQueuedDataReader.scala | 13 ++--
 .../continuous/ContinuousRateStreamSource.scala | 20 ++---
 .../spark/sql/execution/streaming/memory.scala  | 12 +--
 .../sources/ContinuousMemoryStream.scala| 18 ++---
 .../sources/RateStreamMicroBatchReader.scala| 15 ++--
 .../execution/streaming/sources/socket.scala| 29 
 .../sources/v2/JavaAdvancedDataSourceV2.java| 22 +++---
 .../sql/sources/v2/JavaBatchDataSourceV2.java   | 12 +--
 .../v2/JavaPartitionAwareDataSource.java| 12 +--
 .../v2/JavaSchemaRequiredDataSource.java|  4 +-
 .../sql/sources/v2/JavaSimpleDataSourceV2.java  | 18 ++---
 .../sources/v2/JavaUnsafeRowDataSourceV2.java   | 16 ++--
 .../sources/RateStreamProviderSuite.scala   | 12 +--
 .../sql/sources/v2/DataSourceV2Suite.scala  | 78 ++--
 .../sources/v2/SimpleWritableDataSource.scala   | 14 ++--
 .../sql/streaming/StreamingQuerySuite.scala | 11 +--
 .../ContinuousQueuedDataReaderSuite.scala   |  8 +-
 .../sources/StreamingDataSourceV2Suite.scala|  4 +-
 43 files changed, 440 insertions(+), 431 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/62d01391/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
--
diff --git 

[1/2] spark git commit: [SPARK-24073][SQL] Rename DataReaderFactory to InputPartition.

2018-05-09 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/master 9341c951e -> 62d01391f


http://git-wip-us.apache.org/repos/asf/spark/blob/62d01391/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSchemaRequiredDataSource.java
--
diff --git 
a/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSchemaRequiredDataSource.java
 
b/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSchemaRequiredDataSource.java
index 048d078..80eeffd 100644
--- 
a/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSchemaRequiredDataSource.java
+++ 
b/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSchemaRequiredDataSource.java
@@ -24,7 +24,7 @@ import org.apache.spark.sql.sources.v2.DataSourceOptions;
 import org.apache.spark.sql.sources.v2.DataSourceV2;
 import org.apache.spark.sql.sources.v2.ReadSupportWithSchema;
 import org.apache.spark.sql.sources.v2.reader.DataSourceReader;
-import org.apache.spark.sql.sources.v2.reader.DataReaderFactory;
+import org.apache.spark.sql.sources.v2.reader.InputPartition;
 import org.apache.spark.sql.types.StructType;
 
 public class JavaSchemaRequiredDataSource implements DataSourceV2, 
ReadSupportWithSchema {
@@ -42,7 +42,7 @@ public class JavaSchemaRequiredDataSource implements 
DataSourceV2, ReadSupportWi
 }
 
 @Override
-public List createDataReaderFactories() {
+public List planInputPartitions() {
   return java.util.Collections.emptyList();
 }
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/62d01391/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSimpleDataSourceV2.java
--
diff --git 
a/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSimpleDataSourceV2.java
 
b/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSimpleDataSourceV2.java
index 96f55b8..8522a63 100644
--- 
a/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSimpleDataSourceV2.java
+++ 
b/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSimpleDataSourceV2.java
@@ -25,8 +25,8 @@ import org.apache.spark.sql.catalyst.expressions.GenericRow;
 import org.apache.spark.sql.sources.v2.DataSourceV2;
 import org.apache.spark.sql.sources.v2.DataSourceOptions;
 import org.apache.spark.sql.sources.v2.ReadSupport;
-import org.apache.spark.sql.sources.v2.reader.DataReader;
-import org.apache.spark.sql.sources.v2.reader.DataReaderFactory;
+import org.apache.spark.sql.sources.v2.reader.InputPartitionReader;
+import org.apache.spark.sql.sources.v2.reader.InputPartition;
 import org.apache.spark.sql.sources.v2.reader.DataSourceReader;
 import org.apache.spark.sql.types.StructType;
 
@@ -41,25 +41,25 @@ public class JavaSimpleDataSourceV2 implements 
DataSourceV2, ReadSupport {
 }
 
 @Override
-public List createDataReaderFactories() {
+public List planInputPartitions() {
   return java.util.Arrays.asList(
-new JavaSimpleDataReaderFactory(0, 5),
-new JavaSimpleDataReaderFactory(5, 10));
+new JavaSimpleInputPartition(0, 5),
+new JavaSimpleInputPartition(5, 10));
 }
   }
 
-  static class JavaSimpleDataReaderFactory implements DataReaderFactory, 
DataReader {
+  static class JavaSimpleInputPartition implements InputPartition, 
InputPartitionReader {
 private int start;
 private int end;
 
-JavaSimpleDataReaderFactory(int start, int end) {
+JavaSimpleInputPartition(int start, int end) {
   this.start = start;
   this.end = end;
 }
 
 @Override
-public DataReader createDataReader() {
-  return new JavaSimpleDataReaderFactory(start - 1, end);
+public InputPartitionReader createPartitionReader() {
+  return new JavaSimpleInputPartition(start - 1, end);
 }
 
 @Override

http://git-wip-us.apache.org/repos/asf/spark/blob/62d01391/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaUnsafeRowDataSourceV2.java
--
diff --git 
a/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaUnsafeRowDataSourceV2.java
 
b/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaUnsafeRowDataSourceV2.java
index c3916e0..3ad8e7a 100644
--- 
a/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaUnsafeRowDataSourceV2.java
+++ 
b/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaUnsafeRowDataSourceV2.java
@@ -38,20 +38,20 @@ public class JavaUnsafeRowDataSourceV2 implements 
DataSourceV2, ReadSupport {
 }
 
 @Override
-public List createUnsafeRowReaderFactories() 
{
+public List planUnsafeInputPartitions() {
   return java.util.Arrays.asList(
-new JavaUnsafeRowDataReaderFactory(0, 5),
-new 

spark git commit: [SPARK-23852][SQL] Add test that fails if PARQUET-1217 is not fixed

2018-05-09 Thread lixiao
Repository: spark
Updated Branches:
  refs/heads/master 9e3bb3136 -> 9341c951e


[SPARK-23852][SQL] Add test that fails if PARQUET-1217 is not fixed

## What changes were proposed in this pull request?

Add a new test that triggers if PARQUET-1217 - a predicate pushdown bug - is 
not fixed in Spark's Parquet dependency.

## How was this patch tested?

New unit test passes.

Author: Henry Robinson 

Closes #21284 from henryr/spark-23852.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9341c951
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9341c951
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9341c951

Branch: refs/heads/master
Commit: 9341c951e85ff29714cbee302053872a6a4223da
Parents: 9e3bb31
Author: Henry Robinson 
Authored: Wed May 9 19:56:03 2018 -0700
Committer: gatorsmile 
Committed: Wed May 9 19:56:03 2018 -0700

--
 .../test/resources/test-data/parquet-1217.parquet| Bin 0 -> 321 bytes
 .../datasources/parquet/ParquetFilterSuite.scala |  10 ++
 2 files changed, 10 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9341c951/sql/core/src/test/resources/test-data/parquet-1217.parquet
--
diff --git a/sql/core/src/test/resources/test-data/parquet-1217.parquet 
b/sql/core/src/test/resources/test-data/parquet-1217.parquet
new file mode 100644
index 000..eb2dc4f
Binary files /dev/null and 
b/sql/core/src/test/resources/test-data/parquet-1217.parquet differ

http://git-wip-us.apache.org/repos/asf/spark/blob/9341c951/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
index 667e0b1..4d0ecde 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
@@ -648,6 +648,16 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   }
 }
   }
+
+  test("SPARK-23852: Broken Parquet push-down for partially-written stats") {
+// parquet-1217.parquet contains a single column with values -1, 0, 1, 2 
and null.
+// The row-group statistics include null counts, but not min and max 
values, which
+// triggers PARQUET-1217.
+val df = readResourceParquetFile("test-data/parquet-1217.parquet")
+
+// Will return 0 rows if PARQUET-1217 is not fixed.
+assert(df.where("col > 0").count() === 2)
+  }
 }
 
 class NumRowGroupsAcc extends AccumulatorV2[Integer, Integer] {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r26820 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_09_16_01-9e3bb31-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-05-09 Thread pwendell
Author: pwendell
Date: Wed May  9 23:15:57 2018
New Revision: 26820

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_05_09_16_01-9e3bb31 docs


[This commit notification would consist of 1461 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-24141][CORE] Fix bug in CoarseGrainedSchedulerBackend.killExecutors

2018-05-09 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/master fd1179c17 -> 9e3bb3136


[SPARK-24141][CORE] Fix bug in CoarseGrainedSchedulerBackend.killExecutors

## What changes were proposed in this pull request?

In method *CoarseGrainedSchedulerBackend.killExecutors()*, 
`numPendingExecutors` should add
`executorsToKill.size` rather than `knownExecutors.size` if we do not adjust 
target number of executors.

## How was this patch tested?

N/A

Author: wuyi 

Closes #21209 from Ngone51/SPARK-24141.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9e3bb313
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9e3bb313
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9e3bb313

Branch: refs/heads/master
Commit: 9e3bb313682bf88d0c81427167ee341698d29b02
Parents: fd1179c
Author: wuyi 
Authored: Wed May 9 15:44:36 2018 -0700
Committer: Marcelo Vanzin 
Committed: Wed May 9 15:44:36 2018 -0700

--
 .../spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9e3bb313/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
index 5627a55..d8794e8 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
@@ -633,7 +633,7 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
   }
   doRequestTotalExecutors(requestedTotalExecutors)
 } else {
-  numPendingExecutors += knownExecutors.size
+  numPendingExecutors += executorsToKill.size
   Future.successful(true)
 }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r26819 - in /dev/spark/2.3.1-SNAPSHOT-2018_05_09_14_01-8889d78-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-05-09 Thread pwendell
Author: pwendell
Date: Wed May  9 21:15:26 2018
New Revision: 26819

Log:
Apache Spark 2.3.1-SNAPSHOT-2018_05_09_14_01-8889d78 docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r26817 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_09_12_01-fd1179c-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-05-09 Thread pwendell
Author: pwendell
Date: Wed May  9 19:16:51 2018
New Revision: 26817

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_05_09_12_01-fd1179c docs


[This commit notification would consist of 1461 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-24214][SS] Fix toJSON for StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation

2018-05-09 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/branch-2.3 aba52f449 -> 8889d7864


[SPARK-24214][SS] Fix toJSON for 
StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation

## What changes were proposed in this pull request?

We should overwrite "otherCopyArgs" to provide the SparkSession parameter 
otherwise TreeNode.toJSON cannot get the full constructor parameter list.

## How was this patch tested?

The new unit test.

Author: Shixiong Zhu 

Closes #21275 from zsxwing/SPARK-24214.

(cherry picked from commit fd1179c17273283d32f275d5cd5f97aaa2aca1f7)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8889d786
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8889d786
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8889d786

Branch: refs/heads/branch-2.3
Commit: 8889d78643154a0eb5ce81363ba471a80a1e64f1
Parents: aba52f4
Author: Shixiong Zhu 
Authored: Wed May 9 11:32:17 2018 -0700
Committer: Shixiong Zhu 
Committed: Wed May 9 11:32:27 2018 -0700

--
 .../sql/execution/streaming/StreamingRelation.scala  |  3 +++
 .../spark/sql/streaming/StreamingQuerySuite.scala| 15 +++
 2 files changed, 18 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8889d786/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
index f02d3a2..24195b5 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
@@ -66,6 +66,7 @@ case class StreamingExecutionRelation(
 output: Seq[Attribute])(session: SparkSession)
   extends LeafNode with MultiInstanceRelation {
 
+  override def otherCopyArgs: Seq[AnyRef] = session :: Nil
   override def isStreaming: Boolean = true
   override def toString: String = source.toString
 
@@ -97,6 +98,7 @@ case class StreamingRelationV2(
 output: Seq[Attribute],
 v1Relation: Option[StreamingRelation])(session: SparkSession)
   extends LeafNode with MultiInstanceRelation {
+  override def otherCopyArgs: Seq[AnyRef] = session :: Nil
   override def isStreaming: Boolean = true
   override def toString: String = sourceName
 
@@ -116,6 +118,7 @@ case class ContinuousExecutionRelation(
 output: Seq[Attribute])(session: SparkSession)
   extends LeafNode with MultiInstanceRelation {
 
+  override def otherCopyArgs: Seq[AnyRef] = session :: Nil
   override def isStreaming: Boolean = true
   override def toString: String = source.toString
 

http://git-wip-us.apache.org/repos/asf/spark/blob/8889d786/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
index 2b0ab33..e3429b5 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
@@ -687,6 +687,21 @@ class StreamingQuerySuite extends StreamTest with 
BeforeAndAfter with Logging wi
   CheckLastBatch(("A", 1)))
   }
 
+  
test("StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation.toJSON
 " +
+"should not fail") {
+val df = spark.readStream.format("rate").load()
+assert(df.logicalPlan.toJSON.contains("StreamingRelationV2"))
+
+testStream(df)(
+  
AssertOnQuery(_.logicalPlan.toJSON.contains("StreamingExecutionRelation"))
+)
+
+testStream(df, useV2Sink = true)(
+  StartStream(trigger = Trigger.Continuous(100)),
+  
AssertOnQuery(_.logicalPlan.toJSON.contains("ContinuousExecutionRelation"))
+)
+  }
+
   /** Create a streaming DF that only execute one batch in which it returns 
the given static DF */
   private def createSingleTriggerStreamingDF(triggerDF: DataFrame): DataFrame 
= {
 require(!triggerDF.isStreaming)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-24214][SS] Fix toJSON for StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation

2018-05-09 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/master 7aaa148f5 -> fd1179c17


[SPARK-24214][SS] Fix toJSON for 
StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation

## What changes were proposed in this pull request?

We should overwrite "otherCopyArgs" to provide the SparkSession parameter 
otherwise TreeNode.toJSON cannot get the full constructor parameter list.

## How was this patch tested?

The new unit test.

Author: Shixiong Zhu 

Closes #21275 from zsxwing/SPARK-24214.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fd1179c1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fd1179c1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fd1179c1

Branch: refs/heads/master
Commit: fd1179c17273283d32f275d5cd5f97aaa2aca1f7
Parents: 7aaa148
Author: Shixiong Zhu 
Authored: Wed May 9 11:32:17 2018 -0700
Committer: Shixiong Zhu 
Committed: Wed May 9 11:32:17 2018 -0700

--
 .../sql/execution/streaming/StreamingRelation.scala  |  3 +++
 .../spark/sql/streaming/StreamingQuerySuite.scala| 15 +++
 2 files changed, 18 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fd1179c1/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
index f02d3a2..24195b5 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
@@ -66,6 +66,7 @@ case class StreamingExecutionRelation(
 output: Seq[Attribute])(session: SparkSession)
   extends LeafNode with MultiInstanceRelation {
 
+  override def otherCopyArgs: Seq[AnyRef] = session :: Nil
   override def isStreaming: Boolean = true
   override def toString: String = source.toString
 
@@ -97,6 +98,7 @@ case class StreamingRelationV2(
 output: Seq[Attribute],
 v1Relation: Option[StreamingRelation])(session: SparkSession)
   extends LeafNode with MultiInstanceRelation {
+  override def otherCopyArgs: Seq[AnyRef] = session :: Nil
   override def isStreaming: Boolean = true
   override def toString: String = sourceName
 
@@ -116,6 +118,7 @@ case class ContinuousExecutionRelation(
 output: Seq[Attribute])(session: SparkSession)
   extends LeafNode with MultiInstanceRelation {
 
+  override def otherCopyArgs: Seq[AnyRef] = session :: Nil
   override def isStreaming: Boolean = true
   override def toString: String = source.toString
 

http://git-wip-us.apache.org/repos/asf/spark/blob/fd1179c1/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
index 0cb2375..5798699 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
@@ -831,6 +831,21 @@ class StreamingQuerySuite extends StreamTest with 
BeforeAndAfter with Logging wi
   CheckLastBatch(("A", 1)))
   }
 
+  
test("StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation.toJSON
 " +
+"should not fail") {
+val df = spark.readStream.format("rate").load()
+assert(df.logicalPlan.toJSON.contains("StreamingRelationV2"))
+
+testStream(df)(
+  
AssertOnQuery(_.logicalPlan.toJSON.contains("StreamingExecutionRelation"))
+)
+
+testStream(df, useV2Sink = true)(
+  StartStream(trigger = Trigger.Continuous(100)),
+  
AssertOnQuery(_.logicalPlan.toJSON.contains("ContinuousExecutionRelation"))
+)
+  }
+
   /** Create a streaming DF that only execute one batch in which it returns 
the given static DF */
   private def createSingleTriggerStreamingDF(triggerDF: DataFrame): DataFrame 
= {
 require(!triggerDF.isStreaming)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs

2018-05-09 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 628c7b517 -> 7aaa148f5


[SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for 
spark.ml GBTs

## What changes were proposed in this pull request?

Provide evaluateEachIteration method or equivalent for spark.ml GBTs.

## How was this patch tested?

UT.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: WeichenXu 

Closes #21097 from WeichenXu123/GBTeval.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7aaa148f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7aaa148f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7aaa148f

Branch: refs/heads/master
Commit: 7aaa148f593470b2c32221b69097b8b54524eb74
Parents: 628c7b5
Author: WeichenXu 
Authored: Wed May 9 11:09:19 2018 -0700
Committer: Joseph K. Bradley 
Committed: Wed May 9 11:09:19 2018 -0700

--
 .../spark/ml/classification/GBTClassifier.scala | 15 +
 .../spark/ml/regression/GBTRegressor.scala  | 17 ++-
 .../org/apache/spark/ml/tree/treeParams.scala   |  6 +++-
 .../ml/classification/GBTClassifierSuite.scala  | 29 +-
 .../spark/ml/regression/GBTRegressorSuite.scala | 32 ++--
 5 files changed, 94 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7aaa148f/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
index 0aa24f0..3fb6d1e 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
@@ -334,6 +334,21 @@ class GBTClassificationModel private[ml](
   // hard coded loss, which is not meant to be changed in the model
   private val loss = getOldLossType
 
+  /**
+   * Method to compute error or loss for every iteration of gradient boosting.
+   *
+   * @param dataset Dataset for validation.
+   */
+  @Since("2.4.0")
+  def evaluateEachIteration(dataset: Dataset[_]): Array[Double] = {
+val data = dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map {
+  case Row(label: Double, features: Vector) => LabeledPoint(label, 
features)
+}
+GradientBoostedTrees.evaluateEachIteration(data, trees, treeWeights, loss,
+  OldAlgo.Classification
+)
+  }
+
   @Since("2.0.0")
   override def write: MLWriter = new 
GBTClassificationModel.GBTClassificationModelWriter(this)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/7aaa148f/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala
index 8598e80..d7e054b 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala
@@ -34,7 +34,7 @@ import org.apache.spark.ml.util.DefaultParamsReader.Metadata
 import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo}
 import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel => 
OldGBTModel}
 import org.apache.spark.rdd.RDD
-import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions._
 
 /**
@@ -269,6 +269,21 @@ class GBTRegressionModel private[ml](
 new OldGBTModel(OldAlgo.Regression, _trees.map(_.toOld), _treeWeights)
   }
 
+  /**
+   * Method to compute error or loss for every iteration of gradient boosting.
+   *
+   * @param dataset Dataset for validation.
+   * @param loss The loss function used to compute error. Supported options: 
squared, absolute
+   */
+  @Since("2.4.0")
+  def evaluateEachIteration(dataset: Dataset[_], loss: String): Array[Double] 
= {
+val data = dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map {
+  case Row(label: Double, features: Vector) => LabeledPoint(label, 
features)
+}
+GradientBoostedTrees.evaluateEachIteration(data, trees, treeWeights,
+  convertToOldLossType(loss), OldAlgo.Regression)
+  }
+
   @Since("2.0.0")
   override def write: MLWriter = new 
GBTRegressionModel.GBTRegressionModelWriter(this)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/7aaa148f/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala

spark git commit: [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6

2018-05-09 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 866270ea5 -> f9d6a16ce


[SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6

This PR aims to bump Py4J in order to fix the following float/double bug.
Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the 
latest Py4J is 0.10.6.

**BEFORE**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
++
|(id + 17.1335742042)|
++
|   17.1335742042|
++
```

**AFTER**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+-+
|(id + 17.133574204226083)|
+-+
|   17.133574204226083|
+-+
```

Manual.

Author: Dongjoon Hyun 

Closes #18546 from dongjoon-hyun/SPARK-21278.

(cherry picked from commit c8d0aba198c0f593c2b6b656c23b3d0fb7ea98a2)
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f9d6a16c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f9d6a16c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f9d6a16c

Branch: refs/heads/branch-2.2
Commit: f9d6a16cebd55f4dcd1af102ad2fe7ebedee2e74
Parents: 866270e
Author: Dongjoon Hyun 
Authored: Wed Jul 5 16:33:23 2017 -0700
Committer: Marcelo Vanzin 
Committed: Tue May 8 11:21:22 2018 -0700

--
 LICENSE|   2 +-
 bin/pyspark|   2 +-
 bin/pyspark2.cmd   |   2 +-
 core/pom.xml   |   2 +-
 .../org/apache/spark/api/python/PythonUtils.scala  |   2 +-
 dev/deps/spark-deps-hadoop-2.6 |   2 +-
 dev/deps/spark-deps-hadoop-2.7 |   2 +-
 python/README.md   |   2 +-
 python/docs/Makefile   |   2 +-
 python/lib/py4j-0.10.4-src.zip | Bin 74096 -> 0 bytes
 python/lib/py4j-0.10.6-src.zip | Bin 0 -> 80352 bytes
 python/setup.py|   2 +-
 .../org/apache/spark/deploy/yarn/Client.scala  |   2 +-
 .../spark/deploy/yarn/YarnClusterSuite.scala   |   2 +-
 sbin/spark-config.sh   |   2 +-
 15 files changed, 13 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f9d6a16c/LICENSE
--
diff --git a/LICENSE b/LICENSE
index 66a2e8f..39fe0dc 100644
--- a/LICENSE
+++ b/LICENSE
@@ -263,7 +263,7 @@ The text of each license is also included at 
licenses/LICENSE-[project].txt.
  (New BSD license) Protocol Buffer Java API 
(org.spark-project.protobuf:protobuf-java:2.4.1-shaded - 
http://code.google.com/p/protobuf)
  (The BSD License) Fortran to Java ARPACK 
(net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net)
  (The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - 
http://xmlenc.sourceforge.net)
- (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.4 - 
http://py4j.sourceforge.net/)
+ (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.6 - 
http://py4j.sourceforge.net/)
  (Two-clause BSD-style license) JUnit-Interface 
(com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/)
  (BSD licence) sbt and sbt-launch-lib.bash
  (BSD 3 Clause) d3.min.js 
(https://github.com/mbostock/d3/blob/master/LICENSE)

http://git-wip-us.apache.org/repos/asf/spark/blob/f9d6a16c/bin/pyspark
--
diff --git a/bin/pyspark b/bin/pyspark
index 98387c2..d3b512e 100755
--- a/bin/pyspark
+++ b/bin/pyspark
@@ -57,7 +57,7 @@ export PYSPARK_PYTHON
 
 # Add the PySpark classes to the Python path:
 export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
-export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH"
+export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.6-src.zip:$PYTHONPATH"
 
 # Load the PySpark shell.py script when ./pyspark is used interactively:
 export OLD_PYTHONSTARTUP="$PYTHONSTARTUP"

http://git-wip-us.apache.org/repos/asf/spark/blob/f9d6a16c/bin/pyspark2.cmd
--
diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd
index d1ce9da..663670f 100644
--- a/bin/pyspark2.cmd
+++ b/bin/pyspark2.cmd
@@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
 )
 
 set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
-set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.4-src.zip;%PYTHONPATH%
+set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.6-src.zip;%PYTHONPATH%
 
 set 

spark git commit: [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6

2018-05-09 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 c4ecc04c6 -> 8177b2148


[SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6

This PR aims to bump Py4J in order to fix the following float/double bug.
Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the 
latest Py4J is 0.10.6.

**BEFORE**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
++
|(id + 17.1335742042)|
++
|   17.1335742042|
++
```

**AFTER**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+-+
|(id + 17.133574204226083)|
+-+
|   17.133574204226083|
+-+
```

Manual.

Author: Dongjoon Hyun 

Closes #18546 from dongjoon-hyun/SPARK-21278.

(cherry picked from commit c8d0aba198c0f593c2b6b656c23b3d0fb7ea98a2)
Signed-off-by: Marcelo Vanzin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8177b214
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8177b214
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8177b214

Branch: refs/heads/branch-2.1
Commit: 8177b214899320ea8cf18666f31e4653a42b
Parents: c4ecc04
Author: Dongjoon Hyun 
Authored: Wed Jul 5 16:33:23 2017 -0700
Committer: Marcelo Vanzin 
Committed: Tue May 8 12:15:34 2018 -0700

--
 LICENSE|   2 +-
 bin/pyspark|   2 +-
 bin/pyspark2.cmd   |   2 +-
 core/pom.xml   |   2 +-
 .../org/apache/spark/api/python/PythonUtils.scala  |   2 +-
 dev/deps/spark-deps-hadoop-2.6 |   2 +-
 dev/deps/spark-deps-hadoop-2.7 |   2 +-
 python/README.md   |   2 +-
 python/docs/Makefile   |   2 +-
 python/lib/py4j-0.10.4-src.zip | Bin 74096 -> 0 bytes
 python/lib/py4j-0.10.6-src.zip | Bin 0 -> 80352 bytes
 python/setup.py|   2 +-
 sbin/spark-config.sh   |   2 +-
 .../org/apache/spark/deploy/yarn/Client.scala  |   2 +-
 .../spark/deploy/yarn/YarnClusterSuite.scala   |   2 +-
 15 files changed, 13 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8177b214/LICENSE
--
diff --git a/LICENSE b/LICENSE
index 119ecbe..02aaef2 100644
--- a/LICENSE
+++ b/LICENSE
@@ -263,7 +263,7 @@ The text of each license is also included at 
licenses/LICENSE-[project].txt.
  (New BSD license) Protocol Buffer Java API 
(org.spark-project.protobuf:protobuf-java:2.4.1-shaded - 
http://code.google.com/p/protobuf)
  (The BSD License) Fortran to Java ARPACK 
(net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net)
  (The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - 
http://xmlenc.sourceforge.net)
- (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.4 - 
http://py4j.sourceforge.net/)
+ (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.6 - 
http://py4j.sourceforge.net/)
  (Two-clause BSD-style license) JUnit-Interface 
(com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/)
  (BSD licence) sbt and sbt-launch-lib.bash
  (BSD 3 Clause) d3.min.js 
(https://github.com/mbostock/d3/blob/master/LICENSE)

http://git-wip-us.apache.org/repos/asf/spark/blob/8177b214/bin/pyspark
--
diff --git a/bin/pyspark b/bin/pyspark
index 98387c2..d3b512e 100755
--- a/bin/pyspark
+++ b/bin/pyspark
@@ -57,7 +57,7 @@ export PYSPARK_PYTHON
 
 # Add the PySpark classes to the Python path:
 export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
-export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH"
+export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.6-src.zip:$PYTHONPATH"
 
 # Load the PySpark shell.py script when ./pyspark is used interactively:
 export OLD_PYTHONSTARTUP="$PYTHONSTARTUP"

http://git-wip-us.apache.org/repos/asf/spark/blob/8177b214/bin/pyspark2.cmd
--
diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd
index f211c08..46d4d5c 100644
--- a/bin/pyspark2.cmd
+++ b/bin/pyspark2.cmd
@@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
 )
 
 set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
-set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.4-src.zip;%PYTHONPATH%
+set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.6-src.zip;%PYTHONPATH%
 
 set 

[1/2] spark git commit: [SPARKR] Match pyspark features in SparkR communication protocol.

2018-05-09 Thread vanzin
Repository: spark
Updated Branches:
  refs/heads/master 94155d039 -> 628c7b517


[SPARKR] Match pyspark features in SparkR communication protocol.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/628c7b51
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/628c7b51
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/628c7b51

Branch: refs/heads/master
Commit: 628c7b517969c4a7ccb26ea67ab3dd61266073ca
Parents: cc613b5
Author: Marcelo Vanzin 
Authored: Tue Apr 17 13:29:43 2018 -0700
Committer: Marcelo Vanzin 
Committed: Wed May 9 10:47:35 2018 -0700

--
 R/pkg/R/client.R|  4 +-
 R/pkg/R/deserialize.R   | 10 ++--
 R/pkg/R/sparkR.R| 39 --
 R/pkg/inst/worker/daemon.R  |  4 +-
 R/pkg/inst/worker/worker.R  |  5 +-
 .../org/apache/spark/api/r/RAuthHelper.scala| 38 ++
 .../scala/org/apache/spark/api/r/RBackend.scala | 43 ---
 .../spark/api/r/RBackendAuthHandler.scala   | 55 
 .../scala/org/apache/spark/api/r/RRunner.scala  | 35 +
 .../scala/org/apache/spark/deploy/RRunner.scala |  6 ++-
 10 files changed, 210 insertions(+), 29 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/628c7b51/R/pkg/R/client.R
--
diff --git a/R/pkg/R/client.R b/R/pkg/R/client.R
index 9d82814..7244cc9 100644
--- a/R/pkg/R/client.R
+++ b/R/pkg/R/client.R
@@ -19,7 +19,7 @@
 
 # Creates a SparkR client connection object
 # if one doesn't already exist
-connectBackend <- function(hostname, port, timeout) {
+connectBackend <- function(hostname, port, timeout, authSecret) {
   if (exists(".sparkRcon", envir = .sparkREnv)) {
 if (isOpen(.sparkREnv[[".sparkRCon"]])) {
   cat("SparkRBackend client connection already exists\n")
@@ -29,7 +29,7 @@ connectBackend <- function(hostname, port, timeout) {
 
   con <- socketConnection(host = hostname, port = port, server = FALSE,
   blocking = TRUE, open = "wb", timeout = timeout)
-
+  doServerAuth(con, authSecret)
   assign(".sparkRCon", con, envir = .sparkREnv)
   con
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/628c7b51/R/pkg/R/deserialize.R
--
diff --git a/R/pkg/R/deserialize.R b/R/pkg/R/deserialize.R
index a90f7d3..cb03f16 100644
--- a/R/pkg/R/deserialize.R
+++ b/R/pkg/R/deserialize.R
@@ -60,14 +60,18 @@ readTypedObject <- function(con, type) {
 stop(paste("Unsupported type for deserialization", type)))
 }
 
-readString <- function(con) {
-  stringLen <- readInt(con)
-  raw <- readBin(con, raw(), stringLen, endian = "big")
+readStringData <- function(con, len) {
+  raw <- readBin(con, raw(), len, endian = "big")
   string <- rawToChar(raw)
   Encoding(string) <- "UTF-8"
   string
 }
 
+readString <- function(con) {
+  stringLen <- readInt(con)
+  readStringData(con, stringLen)
+}
+
 readInt <- function(con) {
   readBin(con, integer(), n = 1, endian = "big")
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/628c7b51/R/pkg/R/sparkR.R
--
diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index a480ac6..38ee794 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -158,6 +158,10 @@ sparkR.sparkContext <- function(
 " please use the --packages commandline instead", sep = 
","))
 }
 backendPort <- existingPort
+authSecret <- Sys.getenv("SPARKR_BACKEND_AUTH_SECRET")
+if (nchar(authSecret) == 0) {
+  stop("Auth secret not provided in environment.")
+}
   } else {
 path <- tempfile(pattern = "backend_port")
 submitOps <- getClientModeSparkSubmitOpts(
@@ -186,16 +190,27 @@ sparkR.sparkContext <- function(
 monitorPort <- readInt(f)
 rLibPath <- readString(f)
 connectionTimeout <- readInt(f)
+
+# Don't use readString() so that we can provide a useful
+# error message if the R and Java versions are mismatched.
+authSecretLen = readInt(f)
+if (length(authSecretLen) == 0 || authSecretLen == 0) {
+  stop("Unexpected EOF in JVM connection data. Mismatched versions?")
+}
+authSecret <- readStringData(f, authSecretLen)
 close(f)
 file.remove(path)
 if (length(backendPort) == 0 || backendPort == 0 ||
 length(monitorPort) == 0 || monitorPort == 0 ||
-length(rLibPath) != 1) {
+length(rLibPath) != 1 || length(authSecret) == 0) {
   stop("JVM failed to launch")
 }
-assign(".monitorConn",
-   socketConnection(port = monitorPort, timeout = connectionTimeout),
-

[2/2] spark git commit: [PYSPARK] Update py4j to version 0.10.7.

2018-05-09 Thread vanzin
[PYSPARK] Update py4j to version 0.10.7.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cc613b55
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cc613b55
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cc613b55

Branch: refs/heads/master
Commit: cc613b552e753d03cb62661591de59e1c8d82c74
Parents: 94155d0
Author: Marcelo Vanzin 
Authored: Fri Apr 13 14:28:24 2018 -0700
Committer: Marcelo Vanzin 
Committed: Wed May 9 10:47:35 2018 -0700

--
 LICENSE |   2 +-
 bin/pyspark |   6 +-
 bin/pyspark2.cmd|   2 +-
 core/pom.xml|   2 +-
 .../org/apache/spark/SecurityManager.scala  |  12 +--
 .../spark/api/python/PythonGatewayServer.scala  |  50 ++---
 .../org/apache/spark/api/python/PythonRDD.scala |  29 --
 .../apache/spark/api/python/PythonUtils.scala   |   2 +-
 .../spark/api/python/PythonWorkerFactory.scala  |  20 ++--
 .../org/apache/spark/deploy/PythonRunner.scala  |  12 ++-
 .../apache/spark/internal/config/package.scala  |   5 +
 .../spark/security/SocketAuthHelper.scala   | 101 +++
 .../scala/org/apache/spark/util/Utils.scala |  13 ++-
 .../spark/security/SocketAuthHelperSuite.scala  |  97 ++
 dev/deps/spark-deps-hadoop-2.6  |   2 +-
 dev/deps/spark-deps-hadoop-2.7  |   2 +-
 dev/deps/spark-deps-hadoop-3.1  |   2 +-
 dev/run-pip-tests   |   2 +-
 python/README.md|   2 +-
 python/docs/Makefile|   2 +-
 python/lib/py4j-0.10.6-src.zip  | Bin 80352 -> 0 bytes
 python/lib/py4j-0.10.7-src.zip  | Bin 0 -> 42437 bytes
 python/pyspark/context.py   |   4 +-
 python/pyspark/daemon.py|  21 +++-
 python/pyspark/java_gateway.py  |  93 ++---
 python/pyspark/rdd.py   |  21 ++--
 python/pyspark/sql/dataframe.py |  12 +--
 python/pyspark/worker.py|   7 +-
 python/setup.py |   2 +-
 .../org/apache/spark/deploy/yarn/Client.scala   |   2 +-
 .../spark/deploy/yarn/YarnClusterSuite.scala|   2 +-
 sbin/spark-config.sh|   2 +-
 .../scala/org/apache/spark/sql/Dataset.scala|   6 +-
 33 files changed, 417 insertions(+), 120 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cc613b55/LICENSE
--
diff --git a/LICENSE b/LICENSE
index c2b0d72..820f14d 100644
--- a/LICENSE
+++ b/LICENSE
@@ -263,7 +263,7 @@ The text of each license is also included at 
licenses/LICENSE-[project].txt.
  (New BSD license) Protocol Buffer Java API 
(org.spark-project.protobuf:protobuf-java:2.4.1-shaded - 
http://code.google.com/p/protobuf)
  (The BSD License) Fortran to Java ARPACK 
(net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net)
  (The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - 
http://xmlenc.sourceforge.net)
- (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.6 - 
http://py4j.sourceforge.net/)
+ (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.7 - 
http://py4j.sourceforge.net/)
  (Two-clause BSD-style license) JUnit-Interface 
(com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/)
  (BSD licence) sbt and sbt-launch-lib.bash
  (BSD 3 Clause) d3.min.js 
(https://github.com/mbostock/d3/blob/master/LICENSE)

http://git-wip-us.apache.org/repos/asf/spark/blob/cc613b55/bin/pyspark
--
diff --git a/bin/pyspark b/bin/pyspark
index dd28627..5d5affb 100755
--- a/bin/pyspark
+++ b/bin/pyspark
@@ -25,14 +25,14 @@ source "${SPARK_HOME}"/bin/load-spark-env.sh
 export _SPARK_CMD_USAGE="Usage: ./bin/pyspark [options]"
 
 # In Spark 2.0, IPYTHON and IPYTHON_OPTS are removed and pyspark fails to 
launch if either option
-# is set in the user's environment. Instead, users should set 
PYSPARK_DRIVER_PYTHON=ipython 
+# is set in the user's environment. Instead, users should set 
PYSPARK_DRIVER_PYTHON=ipython
 # to use IPython and set PYSPARK_DRIVER_PYTHON_OPTS to pass options when 
starting the Python driver
 # (e.g. PYSPARK_DRIVER_PYTHON_OPTS='notebook').  This supports full 
customization of the IPython
 # and executor Python executables.
 
 # Fail noisily if removed options are set
 if [[ -n "$IPYTHON" || -n "$IPYTHON_OPTS" ]]; then
-  echo "Error in pyspark startup:" 
+  echo "Error in pyspark startup:"
   echo "IPYTHON 

spark git commit: [MINOR][ML][DOC] Improved Naive Bayes user guide explanation

2018-05-09 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 6ea582e36 -> 94155d039


[MINOR][ML][DOC] Improved Naive Bayes user guide explanation

## What changes were proposed in this pull request?

This copies the material from the spark.mllib user guide page for Naive Bayes 
to the spark.ml user guide page.  I also improved the wording and organization 
slightly.

## How was this patch tested?

Built docs locally.

Author: Joseph K. Bradley 

Closes #21272 from jkbradley/nb-doc-update.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/94155d03
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/94155d03
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/94155d03

Branch: refs/heads/master
Commit: 94155d0395324a012db2fc8a57edb3cd90b61e96
Parents: 6ea582e
Author: Joseph K. Bradley 
Authored: Wed May 9 10:34:57 2018 -0700
Committer: Joseph K. Bradley 
Committed: Wed May 9 10:34:57 2018 -0700

--
 docs/ml-classification-regression.md | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/94155d03/docs/ml-classification-regression.md
--
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index d660655..b3d1090 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -455,11 +455,29 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 ## Naive Bayes
 
 [Naive Bayes classifiers](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) 
are a family of simple 
-probabilistic classifiers based on applying Bayes' theorem with strong (naive) 
independence 
-assumptions between the features. The `spark.ml` implementation currently 
supports both [multinomial
-naive 
Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
+probabilistic, multiclass classifiers based on applying Bayes' theorem with 
strong (naive) independence 
+assumptions between every pair of features.
+
+Naive Bayes can be trained very efficiently. With a single pass over the 
training data,
+it computes the conditional probability distribution of each feature given 
each label.
+For prediction, it applies Bayes' theorem to compute the conditional 
probability distribution
+of each label given an observation.
+
+MLlib supports both [multinomial naive 
Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
 and [Bernoulli naive 
Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
-More information can be found in the section on [Naive Bayes in 
MLlib](mllib-naive-bayes.html#naive-bayes-sparkmllib).
+
+*Input data*:
+These models are typically used for [document 
classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).
+Within that context, each observation is a document and each feature 
represents a term.
+A feature's value is the frequency of the term (in multinomial Naive Bayes) or
+a zero or one indicating whether the term was found in the document (in 
Bernoulli Naive Bayes).
+Feature values must be *non-negative*. The model type is selected with an 
optional parameter
+"multinomial" or "bernoulli" with "multinomial" as the default.
+For document classification, the input feature vectors should usually be 
sparse vectors.
+Since the training data is only used once, it is not necessary to cache it.
+
+[Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be 
used by
+setting the parameter $\lambda$ (default to $1.0$). 
 
 **Examples**
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-24181][SQL] Better error message for writing sorted data

2018-05-09 Thread dbtsai
Repository: spark
Updated Branches:
  refs/heads/master cac9b1dea -> 6ea582e36


[SPARK-24181][SQL] Better error message for writing sorted data

## What changes were proposed in this pull request?

The exception message should clearly distinguish sorting and bucketing in 
`save` and `jdbc` write.

When a user tries to write a sorted data using save or insertInto, it will 
throw an exception with message that `s"'$operation' does not support bucketing 
right now""`.

We should throw `s"'$operation' does not support sortBy right now""` instead.

## How was this patch tested?

More tests in `DataFrameReaderWriterSuite.scala`

Author: DB Tsai 

Closes #21235 from dbtsai/fixException.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6ea582e3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6ea582e3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6ea582e3

Branch: refs/heads/master
Commit: 6ea582e36ab0a2e4e01340f6fc8cfb8d493d567d
Parents: cac9b1d
Author: DB Tsai 
Authored: Wed May 9 09:15:16 2018 -0700
Committer: DB Tsai 
Committed: Wed May 9 09:15:16 2018 -0700

--
 .../org/apache/spark/sql/DataFrameWriter.scala  | 12 ++---
 .../spark/sql/sources/BucketedWriteSuite.scala  | 27 +---
 .../sql/test/DataFrameReaderWriterSuite.scala   | 16 ++--
 3 files changed, 46 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6ea582e3/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
index e183fa6..90bea2d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
@@ -330,8 +330,8 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) 
{
   }
 
   private def getBucketSpec: Option[BucketSpec] = {
-if (sortColumnNames.isDefined) {
-  require(numBuckets.isDefined, "sortBy must be used together with 
bucketBy")
+if (sortColumnNames.isDefined && numBuckets.isEmpty) {
+  throw new AnalysisException("sortBy must be used together with bucketBy")
 }
 
 numBuckets.map { n =>
@@ -340,8 +340,12 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
   }
 
   private def assertNotBucketed(operation: String): Unit = {
-if (numBuckets.isDefined || sortColumnNames.isDefined) {
-  throw new AnalysisException(s"'$operation' does not support bucketing 
right now")
+if (getBucketSpec.isDefined) {
+  if (sortColumnNames.isEmpty) {
+throw new AnalysisException(s"'$operation' does not support bucketBy 
right now")
+  } else {
+throw new AnalysisException(s"'$operation' does not support bucketBy 
and sortBy right now")
+  }
 }
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/6ea582e3/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala
index 93f3efe..5ff1ea8 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala
@@ -60,7 +60,10 @@ abstract class BucketedWriteSuite extends QueryTest with 
SQLTestUtils {
 
   test("specify sorting columns without bucketing columns") {
 val df = Seq(1 -> "a", 2 -> "b").toDF("i", "j")
-intercept[IllegalArgumentException](df.write.sortBy("j").saveAsTable("tt"))
+val e = intercept[AnalysisException] {
+  df.write.sortBy("j").saveAsTable("tt")
+}
+assert(e.getMessage == "sortBy must be used together with bucketBy;")
   }
 
   test("sorting by non-orderable column") {
@@ -74,7 +77,16 @@ abstract class BucketedWriteSuite extends QueryTest with 
SQLTestUtils {
 val e = intercept[AnalysisException] {
   df.write.bucketBy(2, "i").parquet("/tmp/path")
 }
-assert(e.getMessage == "'save' does not support bucketing right now;")
+assert(e.getMessage == "'save' does not support bucketBy right now;")
+  }
+
+  test("write bucketed and sorted data using save()") {
+val df = Seq(1 -> "a", 2 -> "b").toDF("i", "j")
+
+val e = intercept[AnalysisException] {
+  df.write.bucketBy(2, "i").sortBy("i").parquet("/tmp/path")
+}
+assert(e.getMessage == "'save' does not support bucketBy and sortBy right 
now;")
   }
 
   test("write bucketed data 

svn commit: r26781 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_09_00_01-cac9b1d-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-05-09 Thread pwendell
Author: pwendell
Date: Wed May  9 07:17:02 2018
New Revision: 26781

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_05_09_00_01-cac9b1d docs


[This commit notification would consist of 1461 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org