spark git commit: [SPARK-22279][SQL] Enable `convertMetastoreOrc` by default
Repository: spark Updated Branches: refs/heads/master 62d01391f -> e3d434994 [SPARK-22279][SQL] Enable `convertMetastoreOrc` by default ## What changes were proposed in this pull request? We reverted `spark.sql.hive.convertMetastoreOrc` at https://github.com/apache/spark/pull/20536 because we should not ignore the table-specific compression conf. Now, it's resolved via [SPARK-23355](https://github.com/apache/spark/commit/8aa1d7b0ede5115297541d29eab4ce5f4fe905cb). ## How was this patch tested? Pass the Jenkins. Author: Dongjoon HyunCloses #21186 from dongjoon-hyun/SPARK-24112. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e3d43499 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e3d43499 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e3d43499 Branch: refs/heads/master Commit: e3d434994733ae16e7e1424fb6de2d22b1a13f99 Parents: 62d0139 Author: Dongjoon Hyun Authored: Thu May 10 13:36:52 2018 +0800 Committer: Wenchen Fan Committed: Thu May 10 13:36:52 2018 +0800 -- docs/sql-programming-guide.md | 3 ++- sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala | 3 +-- 2 files changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e3d43499/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 3e8946e..3f79ed6 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -1017,7 +1017,7 @@ the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also Property NameDefaultMeaning spark.sql.orc.impl -hive +native The name of ORC implementation. It can be one of native and hive. native means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1. @@ -1813,6 +1813,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, the type coercion rules can automatically promote the argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest common type, no matter how the input arguments order. In prior Spark versions, the promotion could fail in some specific orders (e.g., TimestampType, IntegerType and StringType) and throw an exception. - In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` respect the timezone in the input timestamp string, which breaks the assumption that the input timestamp is in a specific timezone. Therefore, these 2 functions can return unexpected results. In version 2.4 and later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if the input timestamp string contains timezone. As an example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care about this problem and want to retain the previous behaivor to keep their query unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. This option will be removed in Spark 3.0 and should only be used as a temporary w orkaround. - In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 2.4, Spark respects Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files. + - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. ## Upgrading From Spark SQL 2.2 to 2.3
[2/2] spark git commit: [SPARK-24073][SQL] Rename DataReaderFactory to InputPartition.
[SPARK-24073][SQL] Rename DataReaderFactory to InputPartition. ## What changes were proposed in this pull request? Renames: * `DataReaderFactory` to `InputPartition` * `DataReader` to `InputPartitionReader` * `createDataReaderFactories` to `planInputPartitions` * `createUnsafeDataReaderFactories` to `planUnsafeInputPartitions` * `createBatchDataReaderFactories` to `planBatchInputPartitions` This fixes the changes in SPARK-23219, which renamed ReadTask to DataReaderFactory. The intent of that change was to make the read and write API match (write side uses DataWriterFactory), but the underlying problem is that the two classes are not equivalent. ReadTask/DataReader function as Iterable/Iterator. One InputPartition is a specific partition of the data to be read, in contrast to DataWriterFactory where the same factory instance is used in all write tasks. InputPartition's purpose is to manage the lifecycle of the associated reader, which is now called InputPartitionReader, with an explicit create operation to mirror the close operation. This was no longer clear from the API because DataReaderFactory appeared to be more generic than it is and it isn't clear why a set of them is produced for a read. ## How was this patch tested? Existing tests, which have been updated to use the new name. Author: Ryan BlueCloses #21145 from rdblue/SPARK-24073-revert-data-reader-factory-rename. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/62d01391 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/62d01391 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/62d01391 Branch: refs/heads/master Commit: 62d01391fee77eedd75b4e3f475ede8b9f0df0c2 Parents: 9341c95 Author: Ryan Blue Authored: Wed May 9 21:48:54 2018 -0700 Committer: gatorsmile Committed: Wed May 9 21:48:54 2018 -0700 -- .../sql/kafka010/KafkaContinuousReader.scala| 20 ++--- .../sql/kafka010/KafkaMicroBatchReader.scala| 21 +++--- .../sql/kafka010/KafkaSourceProvider.scala | 3 +- .../kafka010/KafkaMicroBatchSourceSuite.scala | 2 +- .../sql/sources/v2/MicroBatchReadSupport.java | 2 +- .../v2/reader/ContinuousDataReaderFactory.java | 35 - .../v2/reader/ContinuousInputPartition.java | 35 + .../spark/sql/sources/v2/reader/DataReader.java | 53 - .../sources/v2/reader/DataReaderFactory.java| 61 --- .../sql/sources/v2/reader/DataSourceReader.java | 10 +-- .../sql/sources/v2/reader/InputPartition.java | 61 +++ .../sources/v2/reader/InputPartitionReader.java | 53 + .../v2/reader/SupportsReportPartitioning.java | 2 +- .../v2/reader/SupportsScanColumnarBatch.java| 10 +-- .../v2/reader/SupportsScanUnsafeRow.java| 8 +- .../partitioning/ClusteredDistribution.java | 4 +- .../v2/reader/partitioning/Distribution.java| 6 +- .../v2/reader/partitioning/Partitioning.java| 4 +- .../reader/streaming/ContinuousDataReader.java | 36 - .../ContinuousInputPartitionReader.java | 36 + .../v2/reader/streaming/ContinuousReader.java | 10 +-- .../v2/reader/streaming/MicroBatchReader.java | 2 +- .../datasources/v2/DataSourceRDD.scala | 11 +-- .../datasources/v2/DataSourceV2ScanExec.scala | 46 ++-- .../continuous/ContinuousDataSourceRDD.scala| 22 +++--- .../continuous/ContinuousQueuedDataReader.scala | 13 ++-- .../continuous/ContinuousRateStreamSource.scala | 20 ++--- .../spark/sql/execution/streaming/memory.scala | 12 +-- .../sources/ContinuousMemoryStream.scala| 18 ++--- .../sources/RateStreamMicroBatchReader.scala| 15 ++-- .../execution/streaming/sources/socket.scala| 29 .../sources/v2/JavaAdvancedDataSourceV2.java| 22 +++--- .../sql/sources/v2/JavaBatchDataSourceV2.java | 12 +-- .../v2/JavaPartitionAwareDataSource.java| 12 +-- .../v2/JavaSchemaRequiredDataSource.java| 4 +- .../sql/sources/v2/JavaSimpleDataSourceV2.java | 18 ++--- .../sources/v2/JavaUnsafeRowDataSourceV2.java | 16 ++-- .../sources/RateStreamProviderSuite.scala | 12 +-- .../sql/sources/v2/DataSourceV2Suite.scala | 78 ++-- .../sources/v2/SimpleWritableDataSource.scala | 14 ++-- .../sql/streaming/StreamingQuerySuite.scala | 11 +-- .../ContinuousQueuedDataReaderSuite.scala | 8 +- .../sources/StreamingDataSourceV2Suite.scala| 4 +- 43 files changed, 440 insertions(+), 431 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/62d01391/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala -- diff --git
[1/2] spark git commit: [SPARK-24073][SQL] Rename DataReaderFactory to InputPartition.
Repository: spark Updated Branches: refs/heads/master 9341c951e -> 62d01391f http://git-wip-us.apache.org/repos/asf/spark/blob/62d01391/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSchemaRequiredDataSource.java -- diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSchemaRequiredDataSource.java b/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSchemaRequiredDataSource.java index 048d078..80eeffd 100644 --- a/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSchemaRequiredDataSource.java +++ b/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSchemaRequiredDataSource.java @@ -24,7 +24,7 @@ import org.apache.spark.sql.sources.v2.DataSourceOptions; import org.apache.spark.sql.sources.v2.DataSourceV2; import org.apache.spark.sql.sources.v2.ReadSupportWithSchema; import org.apache.spark.sql.sources.v2.reader.DataSourceReader; -import org.apache.spark.sql.sources.v2.reader.DataReaderFactory; +import org.apache.spark.sql.sources.v2.reader.InputPartition; import org.apache.spark.sql.types.StructType; public class JavaSchemaRequiredDataSource implements DataSourceV2, ReadSupportWithSchema { @@ -42,7 +42,7 @@ public class JavaSchemaRequiredDataSource implements DataSourceV2, ReadSupportWi } @Override -public ListcreateDataReaderFactories() { +public List planInputPartitions() { return java.util.Collections.emptyList(); } } http://git-wip-us.apache.org/repos/asf/spark/blob/62d01391/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSimpleDataSourceV2.java -- diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSimpleDataSourceV2.java b/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSimpleDataSourceV2.java index 96f55b8..8522a63 100644 --- a/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSimpleDataSourceV2.java +++ b/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaSimpleDataSourceV2.java @@ -25,8 +25,8 @@ import org.apache.spark.sql.catalyst.expressions.GenericRow; import org.apache.spark.sql.sources.v2.DataSourceV2; import org.apache.spark.sql.sources.v2.DataSourceOptions; import org.apache.spark.sql.sources.v2.ReadSupport; -import org.apache.spark.sql.sources.v2.reader.DataReader; -import org.apache.spark.sql.sources.v2.reader.DataReaderFactory; +import org.apache.spark.sql.sources.v2.reader.InputPartitionReader; +import org.apache.spark.sql.sources.v2.reader.InputPartition; import org.apache.spark.sql.sources.v2.reader.DataSourceReader; import org.apache.spark.sql.types.StructType; @@ -41,25 +41,25 @@ public class JavaSimpleDataSourceV2 implements DataSourceV2, ReadSupport { } @Override -public List createDataReaderFactories() { +public List planInputPartitions() { return java.util.Arrays.asList( -new JavaSimpleDataReaderFactory(0, 5), -new JavaSimpleDataReaderFactory(5, 10)); +new JavaSimpleInputPartition(0, 5), +new JavaSimpleInputPartition(5, 10)); } } - static class JavaSimpleDataReaderFactory implements DataReaderFactory, DataReader { + static class JavaSimpleInputPartition implements InputPartition, InputPartitionReader { private int start; private int end; -JavaSimpleDataReaderFactory(int start, int end) { +JavaSimpleInputPartition(int start, int end) { this.start = start; this.end = end; } @Override -public DataReader createDataReader() { - return new JavaSimpleDataReaderFactory(start - 1, end); +public InputPartitionReader createPartitionReader() { + return new JavaSimpleInputPartition(start - 1, end); } @Override http://git-wip-us.apache.org/repos/asf/spark/blob/62d01391/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaUnsafeRowDataSourceV2.java -- diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaUnsafeRowDataSourceV2.java b/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaUnsafeRowDataSourceV2.java index c3916e0..3ad8e7a 100644 --- a/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaUnsafeRowDataSourceV2.java +++ b/sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaUnsafeRowDataSourceV2.java @@ -38,20 +38,20 @@ public class JavaUnsafeRowDataSourceV2 implements DataSourceV2, ReadSupport { } @Override -public List createUnsafeRowReaderFactories() { +public List planUnsafeInputPartitions() { return java.util.Arrays.asList( -new JavaUnsafeRowDataReaderFactory(0, 5), -new
spark git commit: [SPARK-23852][SQL] Add test that fails if PARQUET-1217 is not fixed
Repository: spark Updated Branches: refs/heads/master 9e3bb3136 -> 9341c951e [SPARK-23852][SQL] Add test that fails if PARQUET-1217 is not fixed ## What changes were proposed in this pull request? Add a new test that triggers if PARQUET-1217 - a predicate pushdown bug - is not fixed in Spark's Parquet dependency. ## How was this patch tested? New unit test passes. Author: Henry RobinsonCloses #21284 from henryr/spark-23852. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9341c951 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9341c951 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9341c951 Branch: refs/heads/master Commit: 9341c951e85ff29714cbee302053872a6a4223da Parents: 9e3bb31 Author: Henry Robinson Authored: Wed May 9 19:56:03 2018 -0700 Committer: gatorsmile Committed: Wed May 9 19:56:03 2018 -0700 -- .../test/resources/test-data/parquet-1217.parquet| Bin 0 -> 321 bytes .../datasources/parquet/ParquetFilterSuite.scala | 10 ++ 2 files changed, 10 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9341c951/sql/core/src/test/resources/test-data/parquet-1217.parquet -- diff --git a/sql/core/src/test/resources/test-data/parquet-1217.parquet b/sql/core/src/test/resources/test-data/parquet-1217.parquet new file mode 100644 index 000..eb2dc4f Binary files /dev/null and b/sql/core/src/test/resources/test-data/parquet-1217.parquet differ http://git-wip-us.apache.org/repos/asf/spark/blob/9341c951/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala index 667e0b1..4d0ecde 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala @@ -648,6 +648,16 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } } + + test("SPARK-23852: Broken Parquet push-down for partially-written stats") { +// parquet-1217.parquet contains a single column with values -1, 0, 1, 2 and null. +// The row-group statistics include null counts, but not min and max values, which +// triggers PARQUET-1217. +val df = readResourceParquetFile("test-data/parquet-1217.parquet") + +// Will return 0 rows if PARQUET-1217 is not fixed. +assert(df.where("col > 0").count() === 2) + } } class NumRowGroupsAcc extends AccumulatorV2[Integer, Integer] { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r26820 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_09_16_01-9e3bb31-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Wed May 9 23:15:57 2018 New Revision: 26820 Log: Apache Spark 2.4.0-SNAPSHOT-2018_05_09_16_01-9e3bb31 docs [This commit notification would consist of 1461 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-24141][CORE] Fix bug in CoarseGrainedSchedulerBackend.killExecutors
Repository: spark Updated Branches: refs/heads/master fd1179c17 -> 9e3bb3136 [SPARK-24141][CORE] Fix bug in CoarseGrainedSchedulerBackend.killExecutors ## What changes were proposed in this pull request? In method *CoarseGrainedSchedulerBackend.killExecutors()*, `numPendingExecutors` should add `executorsToKill.size` rather than `knownExecutors.size` if we do not adjust target number of executors. ## How was this patch tested? N/A Author: wuyiCloses #21209 from Ngone51/SPARK-24141. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9e3bb313 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9e3bb313 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9e3bb313 Branch: refs/heads/master Commit: 9e3bb313682bf88d0c81427167ee341698d29b02 Parents: fd1179c Author: wuyi Authored: Wed May 9 15:44:36 2018 -0700 Committer: Marcelo Vanzin Committed: Wed May 9 15:44:36 2018 -0700 -- .../spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9e3bb313/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala index 5627a55..d8794e8 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala @@ -633,7 +633,7 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp } doRequestTotalExecutors(requestedTotalExecutors) } else { - numPendingExecutors += knownExecutors.size + numPendingExecutors += executorsToKill.size Future.successful(true) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r26819 - in /dev/spark/2.3.1-SNAPSHOT-2018_05_09_14_01-8889d78-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Wed May 9 21:15:26 2018 New Revision: 26819 Log: Apache Spark 2.3.1-SNAPSHOT-2018_05_09_14_01-8889d78 docs [This commit notification would consist of 1443 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r26817 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_09_12_01-fd1179c-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Wed May 9 19:16:51 2018 New Revision: 26817 Log: Apache Spark 2.4.0-SNAPSHOT-2018_05_09_12_01-fd1179c docs [This commit notification would consist of 1461 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-24214][SS] Fix toJSON for StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation
Repository: spark Updated Branches: refs/heads/branch-2.3 aba52f449 -> 8889d7864 [SPARK-24214][SS] Fix toJSON for StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation ## What changes were proposed in this pull request? We should overwrite "otherCopyArgs" to provide the SparkSession parameter otherwise TreeNode.toJSON cannot get the full constructor parameter list. ## How was this patch tested? The new unit test. Author: Shixiong ZhuCloses #21275 from zsxwing/SPARK-24214. (cherry picked from commit fd1179c17273283d32f275d5cd5f97aaa2aca1f7) Signed-off-by: Shixiong Zhu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8889d786 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8889d786 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8889d786 Branch: refs/heads/branch-2.3 Commit: 8889d78643154a0eb5ce81363ba471a80a1e64f1 Parents: aba52f4 Author: Shixiong Zhu Authored: Wed May 9 11:32:17 2018 -0700 Committer: Shixiong Zhu Committed: Wed May 9 11:32:27 2018 -0700 -- .../sql/execution/streaming/StreamingRelation.scala | 3 +++ .../spark/sql/streaming/StreamingQuerySuite.scala| 15 +++ 2 files changed, 18 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8889d786/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala index f02d3a2..24195b5 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala @@ -66,6 +66,7 @@ case class StreamingExecutionRelation( output: Seq[Attribute])(session: SparkSession) extends LeafNode with MultiInstanceRelation { + override def otherCopyArgs: Seq[AnyRef] = session :: Nil override def isStreaming: Boolean = true override def toString: String = source.toString @@ -97,6 +98,7 @@ case class StreamingRelationV2( output: Seq[Attribute], v1Relation: Option[StreamingRelation])(session: SparkSession) extends LeafNode with MultiInstanceRelation { + override def otherCopyArgs: Seq[AnyRef] = session :: Nil override def isStreaming: Boolean = true override def toString: String = sourceName @@ -116,6 +118,7 @@ case class ContinuousExecutionRelation( output: Seq[Attribute])(session: SparkSession) extends LeafNode with MultiInstanceRelation { + override def otherCopyArgs: Seq[AnyRef] = session :: Nil override def isStreaming: Boolean = true override def toString: String = source.toString http://git-wip-us.apache.org/repos/asf/spark/blob/8889d786/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala index 2b0ab33..e3429b5 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala @@ -687,6 +687,21 @@ class StreamingQuerySuite extends StreamTest with BeforeAndAfter with Logging wi CheckLastBatch(("A", 1))) } + test("StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation.toJSON " + +"should not fail") { +val df = spark.readStream.format("rate").load() +assert(df.logicalPlan.toJSON.contains("StreamingRelationV2")) + +testStream(df)( + AssertOnQuery(_.logicalPlan.toJSON.contains("StreamingExecutionRelation")) +) + +testStream(df, useV2Sink = true)( + StartStream(trigger = Trigger.Continuous(100)), + AssertOnQuery(_.logicalPlan.toJSON.contains("ContinuousExecutionRelation")) +) + } + /** Create a streaming DF that only execute one batch in which it returns the given static DF */ private def createSingleTriggerStreamingDF(triggerDF: DataFrame): DataFrame = { require(!triggerDF.isStreaming) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-24214][SS] Fix toJSON for StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation
Repository: spark Updated Branches: refs/heads/master 7aaa148f5 -> fd1179c17 [SPARK-24214][SS] Fix toJSON for StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation ## What changes were proposed in this pull request? We should overwrite "otherCopyArgs" to provide the SparkSession parameter otherwise TreeNode.toJSON cannot get the full constructor parameter list. ## How was this patch tested? The new unit test. Author: Shixiong ZhuCloses #21275 from zsxwing/SPARK-24214. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fd1179c1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fd1179c1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fd1179c1 Branch: refs/heads/master Commit: fd1179c17273283d32f275d5cd5f97aaa2aca1f7 Parents: 7aaa148 Author: Shixiong Zhu Authored: Wed May 9 11:32:17 2018 -0700 Committer: Shixiong Zhu Committed: Wed May 9 11:32:17 2018 -0700 -- .../sql/execution/streaming/StreamingRelation.scala | 3 +++ .../spark/sql/streaming/StreamingQuerySuite.scala| 15 +++ 2 files changed, 18 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fd1179c1/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala index f02d3a2..24195b5 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala @@ -66,6 +66,7 @@ case class StreamingExecutionRelation( output: Seq[Attribute])(session: SparkSession) extends LeafNode with MultiInstanceRelation { + override def otherCopyArgs: Seq[AnyRef] = session :: Nil override def isStreaming: Boolean = true override def toString: String = source.toString @@ -97,6 +98,7 @@ case class StreamingRelationV2( output: Seq[Attribute], v1Relation: Option[StreamingRelation])(session: SparkSession) extends LeafNode with MultiInstanceRelation { + override def otherCopyArgs: Seq[AnyRef] = session :: Nil override def isStreaming: Boolean = true override def toString: String = sourceName @@ -116,6 +118,7 @@ case class ContinuousExecutionRelation( output: Seq[Attribute])(session: SparkSession) extends LeafNode with MultiInstanceRelation { + override def otherCopyArgs: Seq[AnyRef] = session :: Nil override def isStreaming: Boolean = true override def toString: String = source.toString http://git-wip-us.apache.org/repos/asf/spark/blob/fd1179c1/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala index 0cb2375..5798699 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala @@ -831,6 +831,21 @@ class StreamingQuerySuite extends StreamTest with BeforeAndAfter with Logging wi CheckLastBatch(("A", 1))) } + test("StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation.toJSON " + +"should not fail") { +val df = spark.readStream.format("rate").load() +assert(df.logicalPlan.toJSON.contains("StreamingRelationV2")) + +testStream(df)( + AssertOnQuery(_.logicalPlan.toJSON.contains("StreamingExecutionRelation")) +) + +testStream(df, useV2Sink = true)( + StartStream(trigger = Trigger.Continuous(100)), + AssertOnQuery(_.logicalPlan.toJSON.contains("ContinuousExecutionRelation")) +) + } + /** Create a streaming DF that only execute one batch in which it returns the given static DF */ private def createSingleTriggerStreamingDF(triggerDF: DataFrame): DataFrame = { require(!triggerDF.isStreaming) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs
Repository: spark Updated Branches: refs/heads/master 628c7b517 -> 7aaa148f5 [SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs ## What changes were proposed in this pull request? Provide evaluateEachIteration method or equivalent for spark.ml GBTs. ## How was this patch tested? UT. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: WeichenXuCloses #21097 from WeichenXu123/GBTeval. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7aaa148f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7aaa148f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7aaa148f Branch: refs/heads/master Commit: 7aaa148f593470b2c32221b69097b8b54524eb74 Parents: 628c7b5 Author: WeichenXu Authored: Wed May 9 11:09:19 2018 -0700 Committer: Joseph K. Bradley Committed: Wed May 9 11:09:19 2018 -0700 -- .../spark/ml/classification/GBTClassifier.scala | 15 + .../spark/ml/regression/GBTRegressor.scala | 17 ++- .../org/apache/spark/ml/tree/treeParams.scala | 6 +++- .../ml/classification/GBTClassifierSuite.scala | 29 +- .../spark/ml/regression/GBTRegressorSuite.scala | 32 ++-- 5 files changed, 94 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7aaa148f/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala index 0aa24f0..3fb6d1e 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala @@ -334,6 +334,21 @@ class GBTClassificationModel private[ml]( // hard coded loss, which is not meant to be changed in the model private val loss = getOldLossType + /** + * Method to compute error or loss for every iteration of gradient boosting. + * + * @param dataset Dataset for validation. + */ + @Since("2.4.0") + def evaluateEachIteration(dataset: Dataset[_]): Array[Double] = { +val data = dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map { + case Row(label: Double, features: Vector) => LabeledPoint(label, features) +} +GradientBoostedTrees.evaluateEachIteration(data, trees, treeWeights, loss, + OldAlgo.Classification +) + } + @Since("2.0.0") override def write: MLWriter = new GBTClassificationModel.GBTClassificationModelWriter(this) } http://git-wip-us.apache.org/repos/asf/spark/blob/7aaa148f/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala index 8598e80..d7e054b 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala @@ -34,7 +34,7 @@ import org.apache.spark.ml.util.DefaultParamsReader.Metadata import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo} import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel => OldGBTModel} import org.apache.spark.rdd.RDD -import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.{DataFrame, Dataset, Row} import org.apache.spark.sql.functions._ /** @@ -269,6 +269,21 @@ class GBTRegressionModel private[ml]( new OldGBTModel(OldAlgo.Regression, _trees.map(_.toOld), _treeWeights) } + /** + * Method to compute error or loss for every iteration of gradient boosting. + * + * @param dataset Dataset for validation. + * @param loss The loss function used to compute error. Supported options: squared, absolute + */ + @Since("2.4.0") + def evaluateEachIteration(dataset: Dataset[_], loss: String): Array[Double] = { +val data = dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map { + case Row(label: Double, features: Vector) => LabeledPoint(label, features) +} +GradientBoostedTrees.evaluateEachIteration(data, trees, treeWeights, + convertToOldLossType(loss), OldAlgo.Regression) + } + @Since("2.0.0") override def write: MLWriter = new GBTRegressionModel.GBTRegressionModelWriter(this) } http://git-wip-us.apache.org/repos/asf/spark/blob/7aaa148f/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala
spark git commit: [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6
Repository: spark Updated Branches: refs/heads/branch-2.2 866270ea5 -> f9d6a16ce [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6 This PR aims to bump Py4J in order to fix the following float/double bug. Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6. **BEFORE** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() ++ |(id + 17.1335742042)| ++ | 17.1335742042| ++ ``` **AFTER** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +-+ |(id + 17.133574204226083)| +-+ | 17.133574204226083| +-+ ``` Manual. Author: Dongjoon HyunCloses #18546 from dongjoon-hyun/SPARK-21278. (cherry picked from commit c8d0aba198c0f593c2b6b656c23b3d0fb7ea98a2) Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f9d6a16c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f9d6a16c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f9d6a16c Branch: refs/heads/branch-2.2 Commit: f9d6a16cebd55f4dcd1af102ad2fe7ebedee2e74 Parents: 866270e Author: Dongjoon Hyun Authored: Wed Jul 5 16:33:23 2017 -0700 Committer: Marcelo Vanzin Committed: Tue May 8 11:21:22 2018 -0700 -- LICENSE| 2 +- bin/pyspark| 2 +- bin/pyspark2.cmd | 2 +- core/pom.xml | 2 +- .../org/apache/spark/api/python/PythonUtils.scala | 2 +- dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- python/README.md | 2 +- python/docs/Makefile | 2 +- python/lib/py4j-0.10.4-src.zip | Bin 74096 -> 0 bytes python/lib/py4j-0.10.6-src.zip | Bin 0 -> 80352 bytes python/setup.py| 2 +- .../org/apache/spark/deploy/yarn/Client.scala | 2 +- .../spark/deploy/yarn/YarnClusterSuite.scala | 2 +- sbin/spark-config.sh | 2 +- 15 files changed, 13 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f9d6a16c/LICENSE -- diff --git a/LICENSE b/LICENSE index 66a2e8f..39fe0dc 100644 --- a/LICENSE +++ b/LICENSE @@ -263,7 +263,7 @@ The text of each license is also included at licenses/LICENSE-[project].txt. (New BSD license) Protocol Buffer Java API (org.spark-project.protobuf:protobuf-java:2.4.1-shaded - http://code.google.com/p/protobuf) (The BSD License) Fortran to Java ARPACK (net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net) (The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - http://xmlenc.sourceforge.net) - (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.4 - http://py4j.sourceforge.net/) + (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.6 - http://py4j.sourceforge.net/) (Two-clause BSD-style license) JUnit-Interface (com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/) (BSD licence) sbt and sbt-launch-lib.bash (BSD 3 Clause) d3.min.js (https://github.com/mbostock/d3/blob/master/LICENSE) http://git-wip-us.apache.org/repos/asf/spark/blob/f9d6a16c/bin/pyspark -- diff --git a/bin/pyspark b/bin/pyspark index 98387c2..d3b512e 100755 --- a/bin/pyspark +++ b/bin/pyspark @@ -57,7 +57,7 @@ export PYSPARK_PYTHON # Add the PySpark classes to the Python path: export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH" -export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH" +export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.6-src.zip:$PYTHONPATH" # Load the PySpark shell.py script when ./pyspark is used interactively: export OLD_PYTHONSTARTUP="$PYTHONSTARTUP" http://git-wip-us.apache.org/repos/asf/spark/blob/f9d6a16c/bin/pyspark2.cmd -- diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd index d1ce9da..663670f 100644 --- a/bin/pyspark2.cmd +++ b/bin/pyspark2.cmd @@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" ( ) set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH% -set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.4-src.zip;%PYTHONPATH% +set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.6-src.zip;%PYTHONPATH% set
spark git commit: [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6
Repository: spark Updated Branches: refs/heads/branch-2.1 c4ecc04c6 -> 8177b2148 [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6 This PR aims to bump Py4J in order to fix the following float/double bug. Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6. **BEFORE** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() ++ |(id + 17.1335742042)| ++ | 17.1335742042| ++ ``` **AFTER** ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +-+ |(id + 17.133574204226083)| +-+ | 17.133574204226083| +-+ ``` Manual. Author: Dongjoon HyunCloses #18546 from dongjoon-hyun/SPARK-21278. (cherry picked from commit c8d0aba198c0f593c2b6b656c23b3d0fb7ea98a2) Signed-off-by: Marcelo Vanzin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8177b214 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8177b214 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8177b214 Branch: refs/heads/branch-2.1 Commit: 8177b214899320ea8cf18666f31e4653a42b Parents: c4ecc04 Author: Dongjoon Hyun Authored: Wed Jul 5 16:33:23 2017 -0700 Committer: Marcelo Vanzin Committed: Tue May 8 12:15:34 2018 -0700 -- LICENSE| 2 +- bin/pyspark| 2 +- bin/pyspark2.cmd | 2 +- core/pom.xml | 2 +- .../org/apache/spark/api/python/PythonUtils.scala | 2 +- dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- python/README.md | 2 +- python/docs/Makefile | 2 +- python/lib/py4j-0.10.4-src.zip | Bin 74096 -> 0 bytes python/lib/py4j-0.10.6-src.zip | Bin 0 -> 80352 bytes python/setup.py| 2 +- sbin/spark-config.sh | 2 +- .../org/apache/spark/deploy/yarn/Client.scala | 2 +- .../spark/deploy/yarn/YarnClusterSuite.scala | 2 +- 15 files changed, 13 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8177b214/LICENSE -- diff --git a/LICENSE b/LICENSE index 119ecbe..02aaef2 100644 --- a/LICENSE +++ b/LICENSE @@ -263,7 +263,7 @@ The text of each license is also included at licenses/LICENSE-[project].txt. (New BSD license) Protocol Buffer Java API (org.spark-project.protobuf:protobuf-java:2.4.1-shaded - http://code.google.com/p/protobuf) (The BSD License) Fortran to Java ARPACK (net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net) (The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - http://xmlenc.sourceforge.net) - (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.4 - http://py4j.sourceforge.net/) + (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.6 - http://py4j.sourceforge.net/) (Two-clause BSD-style license) JUnit-Interface (com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/) (BSD licence) sbt and sbt-launch-lib.bash (BSD 3 Clause) d3.min.js (https://github.com/mbostock/d3/blob/master/LICENSE) http://git-wip-us.apache.org/repos/asf/spark/blob/8177b214/bin/pyspark -- diff --git a/bin/pyspark b/bin/pyspark index 98387c2..d3b512e 100755 --- a/bin/pyspark +++ b/bin/pyspark @@ -57,7 +57,7 @@ export PYSPARK_PYTHON # Add the PySpark classes to the Python path: export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH" -export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH" +export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.6-src.zip:$PYTHONPATH" # Load the PySpark shell.py script when ./pyspark is used interactively: export OLD_PYTHONSTARTUP="$PYTHONSTARTUP" http://git-wip-us.apache.org/repos/asf/spark/blob/8177b214/bin/pyspark2.cmd -- diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd index f211c08..46d4d5c 100644 --- a/bin/pyspark2.cmd +++ b/bin/pyspark2.cmd @@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" ( ) set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH% -set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.4-src.zip;%PYTHONPATH% +set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.6-src.zip;%PYTHONPATH% set
[1/2] spark git commit: [SPARKR] Match pyspark features in SparkR communication protocol.
Repository: spark Updated Branches: refs/heads/master 94155d039 -> 628c7b517 [SPARKR] Match pyspark features in SparkR communication protocol. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/628c7b51 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/628c7b51 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/628c7b51 Branch: refs/heads/master Commit: 628c7b517969c4a7ccb26ea67ab3dd61266073ca Parents: cc613b5 Author: Marcelo VanzinAuthored: Tue Apr 17 13:29:43 2018 -0700 Committer: Marcelo Vanzin Committed: Wed May 9 10:47:35 2018 -0700 -- R/pkg/R/client.R| 4 +- R/pkg/R/deserialize.R | 10 ++-- R/pkg/R/sparkR.R| 39 -- R/pkg/inst/worker/daemon.R | 4 +- R/pkg/inst/worker/worker.R | 5 +- .../org/apache/spark/api/r/RAuthHelper.scala| 38 ++ .../scala/org/apache/spark/api/r/RBackend.scala | 43 --- .../spark/api/r/RBackendAuthHandler.scala | 55 .../scala/org/apache/spark/api/r/RRunner.scala | 35 + .../scala/org/apache/spark/deploy/RRunner.scala | 6 ++- 10 files changed, 210 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/628c7b51/R/pkg/R/client.R -- diff --git a/R/pkg/R/client.R b/R/pkg/R/client.R index 9d82814..7244cc9 100644 --- a/R/pkg/R/client.R +++ b/R/pkg/R/client.R @@ -19,7 +19,7 @@ # Creates a SparkR client connection object # if one doesn't already exist -connectBackend <- function(hostname, port, timeout) { +connectBackend <- function(hostname, port, timeout, authSecret) { if (exists(".sparkRcon", envir = .sparkREnv)) { if (isOpen(.sparkREnv[[".sparkRCon"]])) { cat("SparkRBackend client connection already exists\n") @@ -29,7 +29,7 @@ connectBackend <- function(hostname, port, timeout) { con <- socketConnection(host = hostname, port = port, server = FALSE, blocking = TRUE, open = "wb", timeout = timeout) - + doServerAuth(con, authSecret) assign(".sparkRCon", con, envir = .sparkREnv) con } http://git-wip-us.apache.org/repos/asf/spark/blob/628c7b51/R/pkg/R/deserialize.R -- diff --git a/R/pkg/R/deserialize.R b/R/pkg/R/deserialize.R index a90f7d3..cb03f16 100644 --- a/R/pkg/R/deserialize.R +++ b/R/pkg/R/deserialize.R @@ -60,14 +60,18 @@ readTypedObject <- function(con, type) { stop(paste("Unsupported type for deserialization", type))) } -readString <- function(con) { - stringLen <- readInt(con) - raw <- readBin(con, raw(), stringLen, endian = "big") +readStringData <- function(con, len) { + raw <- readBin(con, raw(), len, endian = "big") string <- rawToChar(raw) Encoding(string) <- "UTF-8" string } +readString <- function(con) { + stringLen <- readInt(con) + readStringData(con, stringLen) +} + readInt <- function(con) { readBin(con, integer(), n = 1, endian = "big") } http://git-wip-us.apache.org/repos/asf/spark/blob/628c7b51/R/pkg/R/sparkR.R -- diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R index a480ac6..38ee794 100644 --- a/R/pkg/R/sparkR.R +++ b/R/pkg/R/sparkR.R @@ -158,6 +158,10 @@ sparkR.sparkContext <- function( " please use the --packages commandline instead", sep = ",")) } backendPort <- existingPort +authSecret <- Sys.getenv("SPARKR_BACKEND_AUTH_SECRET") +if (nchar(authSecret) == 0) { + stop("Auth secret not provided in environment.") +} } else { path <- tempfile(pattern = "backend_port") submitOps <- getClientModeSparkSubmitOpts( @@ -186,16 +190,27 @@ sparkR.sparkContext <- function( monitorPort <- readInt(f) rLibPath <- readString(f) connectionTimeout <- readInt(f) + +# Don't use readString() so that we can provide a useful +# error message if the R and Java versions are mismatched. +authSecretLen = readInt(f) +if (length(authSecretLen) == 0 || authSecretLen == 0) { + stop("Unexpected EOF in JVM connection data. Mismatched versions?") +} +authSecret <- readStringData(f, authSecretLen) close(f) file.remove(path) if (length(backendPort) == 0 || backendPort == 0 || length(monitorPort) == 0 || monitorPort == 0 || -length(rLibPath) != 1) { +length(rLibPath) != 1 || length(authSecret) == 0) { stop("JVM failed to launch") } -assign(".monitorConn", - socketConnection(port = monitorPort, timeout = connectionTimeout), -
[2/2] spark git commit: [PYSPARK] Update py4j to version 0.10.7.
[PYSPARK] Update py4j to version 0.10.7. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cc613b55 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cc613b55 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cc613b55 Branch: refs/heads/master Commit: cc613b552e753d03cb62661591de59e1c8d82c74 Parents: 94155d0 Author: Marcelo VanzinAuthored: Fri Apr 13 14:28:24 2018 -0700 Committer: Marcelo Vanzin Committed: Wed May 9 10:47:35 2018 -0700 -- LICENSE | 2 +- bin/pyspark | 6 +- bin/pyspark2.cmd| 2 +- core/pom.xml| 2 +- .../org/apache/spark/SecurityManager.scala | 12 +-- .../spark/api/python/PythonGatewayServer.scala | 50 ++--- .../org/apache/spark/api/python/PythonRDD.scala | 29 -- .../apache/spark/api/python/PythonUtils.scala | 2 +- .../spark/api/python/PythonWorkerFactory.scala | 20 ++-- .../org/apache/spark/deploy/PythonRunner.scala | 12 ++- .../apache/spark/internal/config/package.scala | 5 + .../spark/security/SocketAuthHelper.scala | 101 +++ .../scala/org/apache/spark/util/Utils.scala | 13 ++- .../spark/security/SocketAuthHelperSuite.scala | 97 ++ dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- dev/deps/spark-deps-hadoop-3.1 | 2 +- dev/run-pip-tests | 2 +- python/README.md| 2 +- python/docs/Makefile| 2 +- python/lib/py4j-0.10.6-src.zip | Bin 80352 -> 0 bytes python/lib/py4j-0.10.7-src.zip | Bin 0 -> 42437 bytes python/pyspark/context.py | 4 +- python/pyspark/daemon.py| 21 +++- python/pyspark/java_gateway.py | 93 ++--- python/pyspark/rdd.py | 21 ++-- python/pyspark/sql/dataframe.py | 12 +-- python/pyspark/worker.py| 7 +- python/setup.py | 2 +- .../org/apache/spark/deploy/yarn/Client.scala | 2 +- .../spark/deploy/yarn/YarnClusterSuite.scala| 2 +- sbin/spark-config.sh| 2 +- .../scala/org/apache/spark/sql/Dataset.scala| 6 +- 33 files changed, 417 insertions(+), 120 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cc613b55/LICENSE -- diff --git a/LICENSE b/LICENSE index c2b0d72..820f14d 100644 --- a/LICENSE +++ b/LICENSE @@ -263,7 +263,7 @@ The text of each license is also included at licenses/LICENSE-[project].txt. (New BSD license) Protocol Buffer Java API (org.spark-project.protobuf:protobuf-java:2.4.1-shaded - http://code.google.com/p/protobuf) (The BSD License) Fortran to Java ARPACK (net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net) (The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - http://xmlenc.sourceforge.net) - (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.6 - http://py4j.sourceforge.net/) + (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.7 - http://py4j.sourceforge.net/) (Two-clause BSD-style license) JUnit-Interface (com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/) (BSD licence) sbt and sbt-launch-lib.bash (BSD 3 Clause) d3.min.js (https://github.com/mbostock/d3/blob/master/LICENSE) http://git-wip-us.apache.org/repos/asf/spark/blob/cc613b55/bin/pyspark -- diff --git a/bin/pyspark b/bin/pyspark index dd28627..5d5affb 100755 --- a/bin/pyspark +++ b/bin/pyspark @@ -25,14 +25,14 @@ source "${SPARK_HOME}"/bin/load-spark-env.sh export _SPARK_CMD_USAGE="Usage: ./bin/pyspark [options]" # In Spark 2.0, IPYTHON and IPYTHON_OPTS are removed and pyspark fails to launch if either option -# is set in the user's environment. Instead, users should set PYSPARK_DRIVER_PYTHON=ipython +# is set in the user's environment. Instead, users should set PYSPARK_DRIVER_PYTHON=ipython # to use IPython and set PYSPARK_DRIVER_PYTHON_OPTS to pass options when starting the Python driver # (e.g. PYSPARK_DRIVER_PYTHON_OPTS='notebook'). This supports full customization of the IPython # and executor Python executables. # Fail noisily if removed options are set if [[ -n "$IPYTHON" || -n "$IPYTHON_OPTS" ]]; then - echo "Error in pyspark startup:" + echo "Error in pyspark startup:" echo "IPYTHON
spark git commit: [MINOR][ML][DOC] Improved Naive Bayes user guide explanation
Repository: spark Updated Branches: refs/heads/master 6ea582e36 -> 94155d039 [MINOR][ML][DOC] Improved Naive Bayes user guide explanation ## What changes were proposed in this pull request? This copies the material from the spark.mllib user guide page for Naive Bayes to the spark.ml user guide page. I also improved the wording and organization slightly. ## How was this patch tested? Built docs locally. Author: Joseph K. BradleyCloses #21272 from jkbradley/nb-doc-update. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/94155d03 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/94155d03 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/94155d03 Branch: refs/heads/master Commit: 94155d0395324a012db2fc8a57edb3cd90b61e96 Parents: 6ea582e Author: Joseph K. Bradley Authored: Wed May 9 10:34:57 2018 -0700 Committer: Joseph K. Bradley Committed: Wed May 9 10:34:57 2018 -0700 -- docs/ml-classification-regression.md | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/94155d03/docs/ml-classification-regression.md -- diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md index d660655..b3d1090 100644 --- a/docs/ml-classification-regression.md +++ b/docs/ml-classification-regression.md @@ -455,11 +455,29 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat ## Naive Bayes [Naive Bayes classifiers](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) are a family of simple -probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence -assumptions between the features. The `spark.ml` implementation currently supports both [multinomial -naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html) +probabilistic, multiclass classifiers based on applying Bayes' theorem with strong (naive) independence +assumptions between every pair of features. + +Naive Bayes can be trained very efficiently. With a single pass over the training data, +it computes the conditional probability distribution of each feature given each label. +For prediction, it applies Bayes' theorem to compute the conditional probability distribution +of each label given an observation. + +MLlib supports both [multinomial naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes) and [Bernoulli naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html). -More information can be found in the section on [Naive Bayes in MLlib](mllib-naive-bayes.html#naive-bayes-sparkmllib). + +*Input data*: +These models are typically used for [document classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html). +Within that context, each observation is a document and each feature represents a term. +A feature's value is the frequency of the term (in multinomial Naive Bayes) or +a zero or one indicating whether the term was found in the document (in Bernoulli Naive Bayes). +Feature values must be *non-negative*. The model type is selected with an optional parameter +"multinomial" or "bernoulli" with "multinomial" as the default. +For document classification, the input feature vectors should usually be sparse vectors. +Since the training data is only used once, it is not necessary to cache it. + +[Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by +setting the parameter $\lambda$ (default to $1.0$). **Examples** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-24181][SQL] Better error message for writing sorted data
Repository: spark Updated Branches: refs/heads/master cac9b1dea -> 6ea582e36 [SPARK-24181][SQL] Better error message for writing sorted data ## What changes were proposed in this pull request? The exception message should clearly distinguish sorting and bucketing in `save` and `jdbc` write. When a user tries to write a sorted data using save or insertInto, it will throw an exception with message that `s"'$operation' does not support bucketing right now""`. We should throw `s"'$operation' does not support sortBy right now""` instead. ## How was this patch tested? More tests in `DataFrameReaderWriterSuite.scala` Author: DB TsaiCloses #21235 from dbtsai/fixException. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6ea582e3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6ea582e3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6ea582e3 Branch: refs/heads/master Commit: 6ea582e36ab0a2e4e01340f6fc8cfb8d493d567d Parents: cac9b1d Author: DB Tsai Authored: Wed May 9 09:15:16 2018 -0700 Committer: DB Tsai Committed: Wed May 9 09:15:16 2018 -0700 -- .../org/apache/spark/sql/DataFrameWriter.scala | 12 ++--- .../spark/sql/sources/BucketedWriteSuite.scala | 27 +--- .../sql/test/DataFrameReaderWriterSuite.scala | 16 ++-- 3 files changed, 46 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6ea582e3/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala index e183fa6..90bea2d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala @@ -330,8 +330,8 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { } private def getBucketSpec: Option[BucketSpec] = { -if (sortColumnNames.isDefined) { - require(numBuckets.isDefined, "sortBy must be used together with bucketBy") +if (sortColumnNames.isDefined && numBuckets.isEmpty) { + throw new AnalysisException("sortBy must be used together with bucketBy") } numBuckets.map { n => @@ -340,8 +340,12 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { } private def assertNotBucketed(operation: String): Unit = { -if (numBuckets.isDefined || sortColumnNames.isDefined) { - throw new AnalysisException(s"'$operation' does not support bucketing right now") +if (getBucketSpec.isDefined) { + if (sortColumnNames.isEmpty) { +throw new AnalysisException(s"'$operation' does not support bucketBy right now") + } else { +throw new AnalysisException(s"'$operation' does not support bucketBy and sortBy right now") + } } } http://git-wip-us.apache.org/repos/asf/spark/blob/6ea582e3/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala index 93f3efe..5ff1ea8 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala @@ -60,7 +60,10 @@ abstract class BucketedWriteSuite extends QueryTest with SQLTestUtils { test("specify sorting columns without bucketing columns") { val df = Seq(1 -> "a", 2 -> "b").toDF("i", "j") -intercept[IllegalArgumentException](df.write.sortBy("j").saveAsTable("tt")) +val e = intercept[AnalysisException] { + df.write.sortBy("j").saveAsTable("tt") +} +assert(e.getMessage == "sortBy must be used together with bucketBy;") } test("sorting by non-orderable column") { @@ -74,7 +77,16 @@ abstract class BucketedWriteSuite extends QueryTest with SQLTestUtils { val e = intercept[AnalysisException] { df.write.bucketBy(2, "i").parquet("/tmp/path") } -assert(e.getMessage == "'save' does not support bucketing right now;") +assert(e.getMessage == "'save' does not support bucketBy right now;") + } + + test("write bucketed and sorted data using save()") { +val df = Seq(1 -> "a", 2 -> "b").toDF("i", "j") + +val e = intercept[AnalysisException] { + df.write.bucketBy(2, "i").sortBy("i").parquet("/tmp/path") +} +assert(e.getMessage == "'save' does not support bucketBy and sortBy right now;") } test("write bucketed data
svn commit: r26781 - in /dev/spark/2.4.0-SNAPSHOT-2018_05_09_00_01-cac9b1d-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s
Author: pwendell Date: Wed May 9 07:17:02 2018 New Revision: 26781 Log: Apache Spark 2.4.0-SNAPSHOT-2018_05_09_00_01-cac9b1d docs [This commit notification would consist of 1461 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org