spark git commit: [SPARK-15840][SQL] Add two missing options in documentation and some option related changes
Repository: spark Updated Branches: refs/heads/branch-2.0 ffbc6b796 -> d494a483a [SPARK-15840][SQL] Add two missing options in documentation and some option related changes ## What changes were proposed in this pull request? This PR 1. Adds the documentations for some missing options, `inferSchema` and `mergeSchema` for Python and Scala. 2. Fiixes `[[DataFrame]]` to ```:class:`DataFrame` ``` so that this can be shown - from ![2016-06-09 9 31 16](https://cloud.githubusercontent.com/assets/6477701/15929721/8b864734-2e89-11e6-83f6-207527de4ac9.png) - to (with class link) ![2016-06-09 9 31 00](https://cloud.githubusercontent.com/assets/6477701/15929717/8a03d728-2e89-11e6-8a3f-08294964db22.png) (Please refer [the latest documentation](https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html)) 3. Moves `mergeSchema` option to `ParquetOptions` with removing unused options, `metastoreSchema` and `metastoreTableName`. They are not used anymore. They were removed in https://github.com/apache/spark/commit/e720dda42e806229ccfd970055c7b8a93eb447bf and there are no use cases as below: ```bash grep -r -e METASTORE_SCHEMA -e \"metastoreSchema\" -e \"metastoreTableName\" -e METASTORE_TABLE_NAME . ``` ``` ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: private[sql] val METASTORE_SCHEMA = "metastoreSchema" ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: private[sql] val METASTORE_TABLE_NAME = "metastoreTableName" ./sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala: ParquetFileFormat.METASTORE_TABLE_NAME -> TableIdentifier( ``` It only sets `metastoreTableName` in the last case but does not use the table name. 4. Sets the correct default values (in the documentation) for `compression` option for ORC(`snappy`, see [OrcOptions.scala#L33-L42](https://github.com/apache/spark/blob/3ded5bc4db2badc9ff49554e73421021d854306b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala#L33-L42)) and Parquet(`the value specified in SQLConf`, see [ParquetOptions.scala#L38-L47](https://github.com/apache/spark/blob/3ded5bc4db2badc9ff49554e73421021d854306b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala#L38-L47)) and `columnNameOfCorruptRecord` for JSON(`the value specified in SQLConf`, see [JsonFileFormat.scala#L53-L55](https://github.com/apache/spark/blob/4538443e276597530a27c6922e48503677b13956/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala#L53-L55) and [JsonFileFormat.scala#L105-L106](https://github.com/apache/spark/blob/4538443e276597530a27c6922e48503677b13956/sql/core/src/main/scala/org/apache/sp ark/sql/execution/datasources/json/JsonFileFormat.scala#L105-L106)). ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon Author: Hyukjin Kwon Closes #13576 from HyukjinKwon/SPARK-15840. (cherry picked from commit 9e204c62c6800e03759e04ef68268105d4b86bf2) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d494a483 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d494a483 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d494a483 Branch: refs/heads/branch-2.0 Commit: d494a483aef49766edf9c148dadb5e0c7351ca0d Parents: ffbc6b7 Author: hyukjinkwon Authored: Sat Jun 11 23:20:40 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 23:20:45 2016 -0700 -- python/pyspark/sql/readwriter.py| 40 +--- .../org/apache/spark/sql/DataFrameReader.scala | 18 ++--- .../org/apache/spark/sql/DataFrameWriter.scala | 11 +++--- .../datasources/parquet/ParquetFileFormat.scala | 19 ++ .../datasources/parquet/ParquetOptions.scala| 15 +++- .../spark/sql/hive/HiveMetastoreCatalog.scala | 12 ++ 6 files changed, 65 insertions(+), 50 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d494a483/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 7d1f186..f3182b2 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -209,7 +209,8 @@ class DataFrameReader(object): :param columnNameOfCorruptRecord: allows renaming the new field having malformed string created by ``PERMISSIVE`` mode. This overrides ``spark.sql.columnNameOfCorruptRecord``. If None is set, -
spark git commit: [SPARK-15840][SQL] Add two missing options in documentation and some option related changes
Repository: spark Updated Branches: refs/heads/master e1f986c7a -> 9e204c62c [SPARK-15840][SQL] Add two missing options in documentation and some option related changes ## What changes were proposed in this pull request? This PR 1. Adds the documentations for some missing options, `inferSchema` and `mergeSchema` for Python and Scala. 2. Fiixes `[[DataFrame]]` to ```:class:`DataFrame` ``` so that this can be shown - from ![2016-06-09 9 31 16](https://cloud.githubusercontent.com/assets/6477701/15929721/8b864734-2e89-11e6-83f6-207527de4ac9.png) - to (with class link) ![2016-06-09 9 31 00](https://cloud.githubusercontent.com/assets/6477701/15929717/8a03d728-2e89-11e6-8a3f-08294964db22.png) (Please refer [the latest documentation](https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html)) 3. Moves `mergeSchema` option to `ParquetOptions` with removing unused options, `metastoreSchema` and `metastoreTableName`. They are not used anymore. They were removed in https://github.com/apache/spark/commit/e720dda42e806229ccfd970055c7b8a93eb447bf and there are no use cases as below: ```bash grep -r -e METASTORE_SCHEMA -e \"metastoreSchema\" -e \"metastoreTableName\" -e METASTORE_TABLE_NAME . ``` ``` ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: private[sql] val METASTORE_SCHEMA = "metastoreSchema" ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: private[sql] val METASTORE_TABLE_NAME = "metastoreTableName" ./sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala: ParquetFileFormat.METASTORE_TABLE_NAME -> TableIdentifier( ``` It only sets `metastoreTableName` in the last case but does not use the table name. 4. Sets the correct default values (in the documentation) for `compression` option for ORC(`snappy`, see [OrcOptions.scala#L33-L42](https://github.com/apache/spark/blob/3ded5bc4db2badc9ff49554e73421021d854306b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala#L33-L42)) and Parquet(`the value specified in SQLConf`, see [ParquetOptions.scala#L38-L47](https://github.com/apache/spark/blob/3ded5bc4db2badc9ff49554e73421021d854306b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala#L38-L47)) and `columnNameOfCorruptRecord` for JSON(`the value specified in SQLConf`, see [JsonFileFormat.scala#L53-L55](https://github.com/apache/spark/blob/4538443e276597530a27c6922e48503677b13956/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala#L53-L55) and [JsonFileFormat.scala#L105-L106](https://github.com/apache/spark/blob/4538443e276597530a27c6922e48503677b13956/sql/core/src/main/scala/org/apache/sp ark/sql/execution/datasources/json/JsonFileFormat.scala#L105-L106)). ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon Author: Hyukjin Kwon Closes #13576 from HyukjinKwon/SPARK-15840. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9e204c62 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9e204c62 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9e204c62 Branch: refs/heads/master Commit: 9e204c62c6800e03759e04ef68268105d4b86bf2 Parents: e1f986c Author: hyukjinkwon Authored: Sat Jun 11 23:20:40 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 23:20:40 2016 -0700 -- python/pyspark/sql/readwriter.py| 40 +--- .../org/apache/spark/sql/DataFrameReader.scala | 18 ++--- .../org/apache/spark/sql/DataFrameWriter.scala | 11 +++--- .../datasources/parquet/ParquetFileFormat.scala | 19 ++ .../datasources/parquet/ParquetOptions.scala| 15 +++- .../spark/sql/hive/HiveMetastoreCatalog.scala | 12 ++ 6 files changed, 65 insertions(+), 50 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9e204c62/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 7d1f186..f3182b2 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -209,7 +209,8 @@ class DataFrameReader(object): :param columnNameOfCorruptRecord: allows renaming the new field having malformed string created by ``PERMISSIVE`` mode. This overrides ``spark.sql.columnNameOfCorruptRecord``. If None is set, - it uses the default value ``_corrupt_record``. + it u
spark git commit: [SPARK-15860] Metrics for codegen size and perf
Repository: spark Updated Branches: refs/heads/branch-2.0 796dd1514 -> ffbc6b796 [SPARK-15860] Metrics for codegen size and perf ## What changes were proposed in this pull request? Adds codahale metrics for the codegen source text size and how long it takes to compile. The size is particularly interesting, since the JVM does have hard limits on how large methods can get. To simplify, I added the metrics under a statically-initialized source that is always registered with SparkEnv. ## How was this patch tested? Unit tests Author: Eric Liang Closes #13586 from ericl/spark-15860. (cherry picked from commit e1f986c7a3fcc3864d53ef99ef7f14fa4d262ac3) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ffbc6b79 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ffbc6b79 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ffbc6b79 Branch: refs/heads/branch-2.0 Commit: ffbc6b796591d3e1f3dcb950335871b7826e6b3b Parents: 796dd15 Author: Eric Liang Authored: Sat Jun 11 23:16:21 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 23:16:28 2016 -0700 -- .../apache/spark/metrics/MetricsSystem.scala| 3 +- .../spark/metrics/source/StaticSources.scala| 50 .../spark/metrics/MetricsSystemSuite.scala | 8 ++-- .../expressions/codegen/CodeGenerator.scala | 3 ++ .../expressions/CodeGenerationSuite.scala | 9 5 files changed, 68 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ffbc6b79/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala -- diff --git a/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala b/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala index 0fed991..9b16c11 100644 --- a/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala +++ b/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala @@ -28,7 +28,7 @@ import org.eclipse.jetty.servlet.ServletContextHandler import org.apache.spark.{SecurityManager, SparkConf} import org.apache.spark.internal.Logging import org.apache.spark.metrics.sink.{MetricsServlet, Sink} -import org.apache.spark.metrics.source.Source +import org.apache.spark.metrics.source.{Source, StaticSources} import org.apache.spark.util.Utils /** @@ -96,6 +96,7 @@ private[spark] class MetricsSystem private ( def start() { require(!running, "Attempting to start a MetricsSystem that is already running") running = true +StaticSources.allSources.foreach(registerSource) registerSources() registerSinks() sinks.foreach(_.start) http://git-wip-us.apache.org/repos/asf/spark/blob/ffbc6b79/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala -- diff --git a/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala b/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala new file mode 100644 index 000..6819222 --- /dev/null +++ b/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.metrics.source + +import com.codahale.metrics.MetricRegistry + +import org.apache.spark.annotation.Experimental + +private[spark] object StaticSources { + /** + * The set of all static sources. These sources may be reported to from any class, including + * static classes, without requiring reference to a SparkEnv. + */ + val allSources = Seq(CodegenMetrics) +} + +/** + * :: Experimental :: + * Metrics for code generation. + */ +@Experimental +object CodegenMetrics extends Source { + override val sourceName: String = "CodeGenerator" + override val metricRegistry: MetricRegistry = new MetricRegistry() + + /** + * Histogram of the length of source code text compiled by CodeGenerator (in characters). + */ + val METRIC_S
spark git commit: [SPARK-15860] Metrics for codegen size and perf
Repository: spark Updated Branches: refs/heads/master 3fd2ff4dd -> e1f986c7a [SPARK-15860] Metrics for codegen size and perf ## What changes were proposed in this pull request? Adds codahale metrics for the codegen source text size and how long it takes to compile. The size is particularly interesting, since the JVM does have hard limits on how large methods can get. To simplify, I added the metrics under a statically-initialized source that is always registered with SparkEnv. ## How was this patch tested? Unit tests Author: Eric Liang Closes #13586 from ericl/spark-15860. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e1f986c7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e1f986c7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e1f986c7 Branch: refs/heads/master Commit: e1f986c7a3fcc3864d53ef99ef7f14fa4d262ac3 Parents: 3fd2ff4 Author: Eric Liang Authored: Sat Jun 11 23:16:21 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 23:16:21 2016 -0700 -- .../apache/spark/metrics/MetricsSystem.scala| 3 +- .../spark/metrics/source/StaticSources.scala| 50 .../spark/metrics/MetricsSystemSuite.scala | 8 ++-- .../expressions/codegen/CodeGenerator.scala | 3 ++ .../expressions/CodeGenerationSuite.scala | 9 5 files changed, 68 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e1f986c7/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala -- diff --git a/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala b/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala index 0fed991..9b16c11 100644 --- a/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala +++ b/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala @@ -28,7 +28,7 @@ import org.eclipse.jetty.servlet.ServletContextHandler import org.apache.spark.{SecurityManager, SparkConf} import org.apache.spark.internal.Logging import org.apache.spark.metrics.sink.{MetricsServlet, Sink} -import org.apache.spark.metrics.source.Source +import org.apache.spark.metrics.source.{Source, StaticSources} import org.apache.spark.util.Utils /** @@ -96,6 +96,7 @@ private[spark] class MetricsSystem private ( def start() { require(!running, "Attempting to start a MetricsSystem that is already running") running = true +StaticSources.allSources.foreach(registerSource) registerSources() registerSinks() sinks.foreach(_.start) http://git-wip-us.apache.org/repos/asf/spark/blob/e1f986c7/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala -- diff --git a/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala b/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala new file mode 100644 index 000..6819222 --- /dev/null +++ b/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.metrics.source + +import com.codahale.metrics.MetricRegistry + +import org.apache.spark.annotation.Experimental + +private[spark] object StaticSources { + /** + * The set of all static sources. These sources may be reported to from any class, including + * static classes, without requiring reference to a SparkEnv. + */ + val allSources = Seq(CodegenMetrics) +} + +/** + * :: Experimental :: + * Metrics for code generation. + */ +@Experimental +object CodegenMetrics extends Source { + override val sourceName: String = "CodeGenerator" + override val metricRegistry: MetricRegistry = new MetricRegistry() + + /** + * Histogram of the length of source code text compiled by CodeGenerator (in characters). + */ + val METRIC_SOURCE_CODE_SIZE = metricRegistry.histogram(MetricRegistry.name("sourceCodeSize")) + + /** + * Histogra
spark git commit: Revert "[SPARK-14851][CORE] Support radix sort with nullable longs"
Repository: spark Updated Branches: refs/heads/branch-2.0 7e2bfff20 -> 796dd1514 Revert "[SPARK-14851][CORE] Support radix sort with nullable longs" This reverts commit beb75300455a4f92000b69e740256102d9f2d472. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/796dd151 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/796dd151 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/796dd151 Branch: refs/heads/branch-2.0 Commit: 796dd15142c00e96d2d7180f7909055a3eb1dfdf Parents: 7e2bfff Author: Reynold Xin Authored: Sat Jun 11 15:49:39 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 15:49:39 2016 -0700 -- .../util/collection/unsafe/sort/RadixSort.java | 24 - .../unsafe/sort/UnsafeExternalSorter.java | 11 ++-- .../unsafe/sort/UnsafeInMemorySorter.java | 56 .../unsafe/sort/UnsafeExternalSorterSuite.java | 26 - .../unsafe/sort/UnsafeInMemorySorterSuite.java | 2 +- .../collection/unsafe/sort/RadixSortSuite.scala | 4 +- .../sql/execution/UnsafeExternalRowSorter.java | 20 ++- .../sql/catalyst/expressions/SortOrder.scala| 40 ++ .../sql/execution/UnsafeKVExternalSorter.java | 11 ++-- .../apache/spark/sql/execution/SortExec.scala | 12 ++--- .../spark/sql/execution/SortPrefixUtils.scala | 32 --- .../apache/spark/sql/execution/WindowExec.scala | 4 +- .../execution/joins/CartesianProductExec.scala | 2 +- .../apache/spark/sql/execution/SortSuite.scala | 11 .../sql/execution/benchmark/SortBenchmark.scala | 2 +- 15 files changed, 79 insertions(+), 178 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/796dd151/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java -- diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java index 4043617..4f3f0de 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java @@ -170,13 +170,9 @@ public class RadixSort { /** * Specialization of sort() for key-prefix arrays. In this type of array, each record consists * of two longs, only the second of which is sorted on. - * - * @param startIndex starting index in the array to sort from. This parameter is not supported - *in the plain sort() implementation. */ public static int sortKeyPrefixArray( LongArray array, - int startIndex, int numRecords, int startByteIndex, int endByteIndex, @@ -186,11 +182,10 @@ public class RadixSort { assert endByteIndex <= 7 : "endByteIndex (" + endByteIndex + ") should <= 7"; assert endByteIndex > startByteIndex; assert numRecords * 4 <= array.size(); -int inIndex = startIndex; -int outIndex = startIndex + numRecords * 2; +int inIndex = 0; +int outIndex = numRecords * 2; if (numRecords > 0) { - long[][] counts = getKeyPrefixArrayCounts( -array, startIndex, numRecords, startByteIndex, endByteIndex); + long[][] counts = getKeyPrefixArrayCounts(array, numRecords, startByteIndex, endByteIndex); for (int i = startByteIndex; i <= endByteIndex; i++) { if (counts[i] != null) { sortKeyPrefixArrayAtByte( @@ -210,14 +205,13 @@ public class RadixSort { * getCounts with some added parameters but that seems to hurt in benchmarks. */ private static long[][] getKeyPrefixArrayCounts( - LongArray array, int startIndex, int numRecords, int startByteIndex, int endByteIndex) { + LongArray array, int numRecords, int startByteIndex, int endByteIndex) { long[][] counts = new long[8][]; long bitwiseMax = 0; long bitwiseMin = -1L; -long baseOffset = array.getBaseOffset() + startIndex * 8L; -long limit = baseOffset + numRecords * 16L; +long limit = array.getBaseOffset() + numRecords * 16; Object baseObject = array.getBaseObject(); -for (long offset = baseOffset; offset < limit; offset += 16) { +for (long offset = array.getBaseOffset(); offset < limit; offset += 16) { long value = Platform.getLong(baseObject, offset + 8); bitwiseMax |= value; bitwiseMin &= value; @@ -226,7 +220,7 @@ public class RadixSort { for (int i = startByteIndex; i <= endByteIndex; i++) { if (((bitsChanged >>> (i * 8)) & 0xff) != 0) { counts[i] = new long[256]; -for (long offset = baseOffset; offset < limit; offset += 16) { +for (long offset = array.getBaseOffset(); offset < limit; offset += 16) { counts[i][(int)((Pl
spark git commit: [SPARK-15807][SQL] Support varargs for dropDuplicates in Dataset/DataFrame
Repository: spark Updated Branches: refs/heads/branch-2.0 beb753004 -> 7e2bfff20 [SPARK-15807][SQL] Support varargs for dropDuplicates in Dataset/DataFrame ## What changes were proposed in this pull request? This PR adds `varargs`-types `dropDuplicates` functions in `Dataset/DataFrame`. Currently, `dropDuplicates` supports only `Seq` or `Array`. **Before** ```scala scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2))) ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int] scala> ds.dropDuplicates(Seq("_1", "_2")) res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, _2: int] scala> ds.dropDuplicates("_1", "_2") :26: error: overloaded method value dropDuplicates with alternatives: (colNames: Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] (colNames: Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] cannot be applied to (String, String) ds.dropDuplicates("_1", "_2") ^ ``` **After** ```scala scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2))) ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int] scala> ds.dropDuplicates("_1", "_2") res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, _2: int] ``` ## How was this patch tested? Pass the Jenkins tests with new testcases. Author: Dongjoon Hyun Closes #13545 from dongjoon-hyun/SPARK-15807. (cherry picked from commit 3fd2ff4dd85633af49865456a52bf0c09c99708b) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7e2bfff2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7e2bfff2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7e2bfff2 Branch: refs/heads/branch-2.0 Commit: 7e2bfff20c7278a20dca857cfd452b96d4d97c1a Parents: beb7530 Author: Dongjoon Hyun Authored: Sat Jun 11 15:47:51 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 15:47:57 2016 -0700 -- .../src/main/scala/org/apache/spark/sql/Dataset.scala | 13 + .../scala/org/apache/spark/sql/DataFrameSuite.scala| 4 .../test/scala/org/apache/spark/sql/DatasetSuite.scala | 13 + 3 files changed, 30 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7e2bfff2/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 16bbf30..5a67fc7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -1834,6 +1834,19 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Array[String]): Dataset[T] = dropDuplicates(colNames.toSeq) /** + * Returns a new [[Dataset]] with duplicate rows removed, considering only + * the subset of columns. + * + * @group typedrel + * @since 2.0.0 + */ + @scala.annotation.varargs + def dropDuplicates(col1: String, cols: String*): Dataset[T] = { +val colNames: Seq[String] = col1 +: cols +dropDuplicates(colNames) + } + + /** * Computes statistics for numeric columns, including count, mean, stddev, min, and max. * If no columns are given, this function computes statistics for all numerical columns. * http://git-wip-us.apache.org/repos/asf/spark/blob/7e2bfff2/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala index a02e48d..6bb0ce9 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala @@ -906,6 +906,10 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { checkAnswer( testData.dropDuplicates(Seq("value2")), Seq(Row(2, 1, 2), Row(1, 1, 1))) + +checkAnswer( + testData.dropDuplicates("key", "value1"), + Seq(Row(2, 1, 2), Row(1, 2, 1), Row(1, 1, 1), Row(2, 2, 2))) } test("SPARK-7150 range api") { http://git-wip-us.apache.org/repos/asf/spark/blob/7e2bfff2/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index 11b52bd..4536a73 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/sca
spark git commit: [SPARK-15807][SQL] Support varargs for dropDuplicates in Dataset/DataFrame
Repository: spark Updated Branches: refs/heads/master c06c58bbb -> 3fd2ff4dd [SPARK-15807][SQL] Support varargs for dropDuplicates in Dataset/DataFrame ## What changes were proposed in this pull request? This PR adds `varargs`-types `dropDuplicates` functions in `Dataset/DataFrame`. Currently, `dropDuplicates` supports only `Seq` or `Array`. **Before** ```scala scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2))) ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int] scala> ds.dropDuplicates(Seq("_1", "_2")) res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, _2: int] scala> ds.dropDuplicates("_1", "_2") :26: error: overloaded method value dropDuplicates with alternatives: (colNames: Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] (colNames: Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] cannot be applied to (String, String) ds.dropDuplicates("_1", "_2") ^ ``` **After** ```scala scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2))) ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int] scala> ds.dropDuplicates("_1", "_2") res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, _2: int] ``` ## How was this patch tested? Pass the Jenkins tests with new testcases. Author: Dongjoon Hyun Closes #13545 from dongjoon-hyun/SPARK-15807. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3fd2ff4d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3fd2ff4d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3fd2ff4d Branch: refs/heads/master Commit: 3fd2ff4dd85633af49865456a52bf0c09c99708b Parents: c06c58b Author: Dongjoon Hyun Authored: Sat Jun 11 15:47:51 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 15:47:51 2016 -0700 -- .../src/main/scala/org/apache/spark/sql/Dataset.scala | 13 + .../scala/org/apache/spark/sql/DataFrameSuite.scala| 4 .../test/scala/org/apache/spark/sql/DatasetSuite.scala | 13 + 3 files changed, 30 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3fd2ff4d/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 16bbf30..5a67fc7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -1834,6 +1834,19 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Array[String]): Dataset[T] = dropDuplicates(colNames.toSeq) /** + * Returns a new [[Dataset]] with duplicate rows removed, considering only + * the subset of columns. + * + * @group typedrel + * @since 2.0.0 + */ + @scala.annotation.varargs + def dropDuplicates(col1: String, cols: String*): Dataset[T] = { +val colNames: Seq[String] = col1 +: cols +dropDuplicates(colNames) + } + + /** * Computes statistics for numeric columns, including count, mean, stddev, min, and max. * If no columns are given, this function computes statistics for all numerical columns. * http://git-wip-us.apache.org/repos/asf/spark/blob/3fd2ff4d/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala index a02e48d..6bb0ce9 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala @@ -906,6 +906,10 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { checkAnswer( testData.dropDuplicates(Seq("value2")), Seq(Row(2, 1, 2), Row(1, 1, 1))) + +checkAnswer( + testData.dropDuplicates("key", "value1"), + Seq(Row(2, 1, 2), Row(1, 2, 1), Row(1, 1, 1), Row(2, 2, 2))) } test("SPARK-7150 range api") { http://git-wip-us.apache.org/repos/asf/spark/blob/3fd2ff4d/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala index 11b52bd..4536a73 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala @@ -806,6 +806,19 @@ class DatasetSuite extends QueryTest with
spark git commit: [SPARK-14851][CORE] Support radix sort with nullable longs
Repository: spark Updated Branches: refs/heads/branch-2.0 0cf31f0c8 -> beb753004 [SPARK-14851][CORE] Support radix sort with nullable longs ## What changes were proposed in this pull request? This adds support for radix sort of nullable long fields. When a sort field is null and radix sort is enabled, we keep nulls in a separate region of the sort buffer so that radix sort does not need to deal with them. This also has performance benefits when sorting smaller integer types, since the current representation of nulls in two's complement (Long.MIN_VALUE) otherwise forces a full-width radix sort. This strategy for nulls does mean the sort is no longer stable. cc davies ## How was this patch tested? Existing randomized sort tests for correctness. I also tested some TPCDS queries and there does not seem to be any significant regression for non-null sorts. Some test queries (best of 5 runs each). Before change: scala> val start = System.nanoTime; spark.range(500).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6 start: Long = 3190437233227987 res3: Double = 4716.471091 After change: scala> val start = System.nanoTime; spark.range(500).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6 start: Long = 3190367870952791 res4: Double = 2981.143045 Author: Eric Liang Closes #13161 from ericl/sc-2998. (cherry picked from commit c06c582de0c22cfc70c486d23a94c3079ba4) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/beb75300 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/beb75300 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/beb75300 Branch: refs/heads/branch-2.0 Commit: beb75300455a4f92000b69e740256102d9f2d472 Parents: 0cf31f0 Author: Eric Liang Authored: Sat Jun 11 15:42:58 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 15:43:03 2016 -0700 -- .../util/collection/unsafe/sort/RadixSort.java | 24 + .../unsafe/sort/UnsafeExternalSorter.java | 11 ++-- .../unsafe/sort/UnsafeInMemorySorter.java | 56 .../unsafe/sort/UnsafeExternalSorterSuite.java | 26 - .../unsafe/sort/UnsafeInMemorySorterSuite.java | 2 +- .../collection/unsafe/sort/RadixSortSuite.scala | 4 +- .../sql/execution/UnsafeExternalRowSorter.java | 20 +-- .../sql/catalyst/expressions/SortOrder.scala| 40 -- .../sql/execution/UnsafeKVExternalSorter.java | 11 ++-- .../apache/spark/sql/execution/SortExec.scala | 12 +++-- .../spark/sql/execution/SortPrefixUtils.scala | 32 +++ .../apache/spark/sql/execution/WindowExec.scala | 4 +- .../execution/joins/CartesianProductExec.scala | 2 +- .../apache/spark/sql/execution/SortSuite.scala | 11 .../sql/execution/benchmark/SortBenchmark.scala | 2 +- 15 files changed, 178 insertions(+), 79 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/beb75300/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java -- diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java index 4f3f0de..4043617 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java @@ -170,9 +170,13 @@ public class RadixSort { /** * Specialization of sort() for key-prefix arrays. In this type of array, each record consists * of two longs, only the second of which is sorted on. + * + * @param startIndex starting index in the array to sort from. This parameter is not supported + *in the plain sort() implementation. */ public static int sortKeyPrefixArray( LongArray array, + int startIndex, int numRecords, int startByteIndex, int endByteIndex, @@ -182,10 +186,11 @@ public class RadixSort { assert endByteIndex <= 7 : "endByteIndex (" + endByteIndex + ") should <= 7"; assert endByteIndex > startByteIndex; assert numRecords * 4 <= array.size(); -int inIndex = 0; -int outIndex = numRecords * 2; +int inIndex = startIndex; +int outIndex = startIndex + numRecords * 2; if (numRecords > 0) { - long[][] counts = getKeyPrefixArrayCounts(array, numRecords, startByteIndex, endByteIndex); + long[][] counts = getKeyPrefixArrayCounts( +array, startIndex, numRecords, startByteIndex, endByteIndex); for (int i = startByteIndex; i <= endByteIndex; i++) { if (co
spark git commit: [SPARK-14851][CORE] Support radix sort with nullable longs
Repository: spark Updated Branches: refs/heads/master 75705e8db -> c06c58bbb [SPARK-14851][CORE] Support radix sort with nullable longs ## What changes were proposed in this pull request? This adds support for radix sort of nullable long fields. When a sort field is null and radix sort is enabled, we keep nulls in a separate region of the sort buffer so that radix sort does not need to deal with them. This also has performance benefits when sorting smaller integer types, since the current representation of nulls in two's complement (Long.MIN_VALUE) otherwise forces a full-width radix sort. This strategy for nulls does mean the sort is no longer stable. cc davies ## How was this patch tested? Existing randomized sort tests for correctness. I also tested some TPCDS queries and there does not seem to be any significant regression for non-null sorts. Some test queries (best of 5 runs each). Before change: scala> val start = System.nanoTime; spark.range(500).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6 start: Long = 3190437233227987 res3: Double = 4716.471091 After change: scala> val start = System.nanoTime; spark.range(500).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6 start: Long = 3190367870952791 res4: Double = 2981.143045 Author: Eric Liang Closes #13161 from ericl/sc-2998. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c06c58bb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c06c58bb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c06c58bb Branch: refs/heads/master Commit: c06c582de0c22cfc70c486d23a94c3079ba4 Parents: 75705e8 Author: Eric Liang Authored: Sat Jun 11 15:42:58 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 15:42:58 2016 -0700 -- .../util/collection/unsafe/sort/RadixSort.java | 24 + .../unsafe/sort/UnsafeExternalSorter.java | 11 ++-- .../unsafe/sort/UnsafeInMemorySorter.java | 56 .../unsafe/sort/UnsafeExternalSorterSuite.java | 26 - .../unsafe/sort/UnsafeInMemorySorterSuite.java | 2 +- .../collection/unsafe/sort/RadixSortSuite.scala | 4 +- .../sql/execution/UnsafeExternalRowSorter.java | 20 +-- .../sql/catalyst/expressions/SortOrder.scala| 40 -- .../sql/execution/UnsafeKVExternalSorter.java | 11 ++-- .../apache/spark/sql/execution/SortExec.scala | 12 +++-- .../spark/sql/execution/SortPrefixUtils.scala | 32 +++ .../apache/spark/sql/execution/WindowExec.scala | 4 +- .../execution/joins/CartesianProductExec.scala | 2 +- .../apache/spark/sql/execution/SortSuite.scala | 11 .../sql/execution/benchmark/SortBenchmark.scala | 2 +- 15 files changed, 178 insertions(+), 79 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c06c58bb/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java -- diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java index 4f3f0de..4043617 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java @@ -170,9 +170,13 @@ public class RadixSort { /** * Specialization of sort() for key-prefix arrays. In this type of array, each record consists * of two longs, only the second of which is sorted on. + * + * @param startIndex starting index in the array to sort from. This parameter is not supported + *in the plain sort() implementation. */ public static int sortKeyPrefixArray( LongArray array, + int startIndex, int numRecords, int startByteIndex, int endByteIndex, @@ -182,10 +186,11 @@ public class RadixSort { assert endByteIndex <= 7 : "endByteIndex (" + endByteIndex + ") should <= 7"; assert endByteIndex > startByteIndex; assert numRecords * 4 <= array.size(); -int inIndex = 0; -int outIndex = numRecords * 2; +int inIndex = startIndex; +int outIndex = startIndex + numRecords * 2; if (numRecords > 0) { - long[][] counts = getKeyPrefixArrayCounts(array, numRecords, startByteIndex, endByteIndex); + long[][] counts = getKeyPrefixArrayCounts( +array, startIndex, numRecords, startByteIndex, endByteIndex); for (int i = startByteIndex; i <= endByteIndex; i++) { if (counts[i] != null) { sortKeyPrefixArrayAtByte( @@ -205,13 +210,14 @@ public class RadixSort {
spark git commit: [SPARK-15856][SQL] Revert API breaking changes made in SQLContext.range
Repository: spark Updated Branches: refs/heads/branch-2.0 304ec5de3 -> 0cf31f0c8 [SPARK-15856][SQL] Revert API breaking changes made in SQLContext.range ## What changes were proposed in this pull request? It's easy for users to call `range(...).as[Long]` to get typed Dataset, and don't worth an API breaking change. This PR reverts it. ## How was this patch tested? N/A Author: Wenchen Fan Closes #13605 from cloud-fan/range. (cherry picked from commit 75705e8dbb51ac91ffc7012fa67f072494c13832) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0cf31f0c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0cf31f0c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0cf31f0c Branch: refs/heads/branch-2.0 Commit: 0cf31f0c8486ac3f8efca84bcfec75c2d0dd738a Parents: 304ec5d Author: Wenchen Fan Authored: Sat Jun 11 15:28:40 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 15:28:45 2016 -0700 -- .../scala/org/apache/spark/sql/SQLContext.scala | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0cf31f0c/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala index 23f2b6e..6fcc9bb 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala @@ -609,51 +609,51 @@ class SQLContext private[sql](val sparkSession: SparkSession) /** * :: Experimental :: - * Creates a [[Dataset]] with a single [[LongType]] column named `id`, containing elements + * Creates a [[DataFrame]] with a single [[LongType]] column named `id`, containing elements * in a range from 0 to `end` (exclusive) with step value 1. * - * @since 2.0.0 - * @group dataset + * @since 1.4.1 + * @group dataframe */ @Experimental - def range(end: Long): Dataset[java.lang.Long] = sparkSession.range(end) + def range(end: Long): DataFrame = sparkSession.range(end).toDF() /** * :: Experimental :: - * Creates a [[Dataset]] with a single [[LongType]] column named `id`, containing elements + * Creates a [[DataFrame]] with a single [[LongType]] column named `id`, containing elements * in a range from `start` to `end` (exclusive) with step value 1. * - * @since 2.0.0 - * @group dataset + * @since 1.4.0 + * @group dataframe */ @Experimental - def range(start: Long, end: Long): Dataset[java.lang.Long] = sparkSession.range(start, end) + def range(start: Long, end: Long): DataFrame = sparkSession.range(start, end).toDF() /** * :: Experimental :: - * Creates a [[Dataset]] with a single [[LongType]] column named `id`, containing elements + * Creates a [[DataFrame]] with a single [[LongType]] column named `id`, containing elements * in a range from `start` to `end` (exclusive) with a step value. * * @since 2.0.0 - * @group dataset + * @group dataframe */ @Experimental - def range(start: Long, end: Long, step: Long): Dataset[java.lang.Long] = { -sparkSession.range(start, end, step) + def range(start: Long, end: Long, step: Long): DataFrame = { +sparkSession.range(start, end, step).toDF() } /** * :: Experimental :: - * Creates a [[Dataset]] with a single [[LongType]] column named `id`, containing elements - * in a range from `start` to `end` (exclusive) with a step value, with partition number + * Creates a [[DataFrame]] with a single [[LongType]] column named `id`, containing elements + * in an range from `start` to `end` (exclusive) with an step value, with partition number * specified. * - * @since 2.0.0 - * @group dataset + * @since 1.4.0 + * @group dataframe */ @Experimental - def range(start: Long, end: Long, step: Long, numPartitions: Int): Dataset[java.lang.Long] = { -sparkSession.range(start, end, step, numPartitions) + def range(start: Long, end: Long, step: Long, numPartitions: Int): DataFrame = { +sparkSession.range(start, end, step, numPartitions).toDF() } /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15856][SQL] Revert API breaking changes made in SQLContext.range
Repository: spark Updated Branches: refs/heads/master 5bb4564cd -> 75705e8db [SPARK-15856][SQL] Revert API breaking changes made in SQLContext.range ## What changes were proposed in this pull request? It's easy for users to call `range(...).as[Long]` to get typed Dataset, and don't worth an API breaking change. This PR reverts it. ## How was this patch tested? N/A Author: Wenchen Fan Closes #13605 from cloud-fan/range. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/75705e8d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/75705e8d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/75705e8d Branch: refs/heads/master Commit: 75705e8dbb51ac91ffc7012fa67f072494c13832 Parents: 5bb4564 Author: Wenchen Fan Authored: Sat Jun 11 15:28:40 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 15:28:40 2016 -0700 -- .../scala/org/apache/spark/sql/SQLContext.scala | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/75705e8d/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala index 23f2b6e..6fcc9bb 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala @@ -609,51 +609,51 @@ class SQLContext private[sql](val sparkSession: SparkSession) /** * :: Experimental :: - * Creates a [[Dataset]] with a single [[LongType]] column named `id`, containing elements + * Creates a [[DataFrame]] with a single [[LongType]] column named `id`, containing elements * in a range from 0 to `end` (exclusive) with step value 1. * - * @since 2.0.0 - * @group dataset + * @since 1.4.1 + * @group dataframe */ @Experimental - def range(end: Long): Dataset[java.lang.Long] = sparkSession.range(end) + def range(end: Long): DataFrame = sparkSession.range(end).toDF() /** * :: Experimental :: - * Creates a [[Dataset]] with a single [[LongType]] column named `id`, containing elements + * Creates a [[DataFrame]] with a single [[LongType]] column named `id`, containing elements * in a range from `start` to `end` (exclusive) with step value 1. * - * @since 2.0.0 - * @group dataset + * @since 1.4.0 + * @group dataframe */ @Experimental - def range(start: Long, end: Long): Dataset[java.lang.Long] = sparkSession.range(start, end) + def range(start: Long, end: Long): DataFrame = sparkSession.range(start, end).toDF() /** * :: Experimental :: - * Creates a [[Dataset]] with a single [[LongType]] column named `id`, containing elements + * Creates a [[DataFrame]] with a single [[LongType]] column named `id`, containing elements * in a range from `start` to `end` (exclusive) with a step value. * * @since 2.0.0 - * @group dataset + * @group dataframe */ @Experimental - def range(start: Long, end: Long, step: Long): Dataset[java.lang.Long] = { -sparkSession.range(start, end, step) + def range(start: Long, end: Long, step: Long): DataFrame = { +sparkSession.range(start, end, step).toDF() } /** * :: Experimental :: - * Creates a [[Dataset]] with a single [[LongType]] column named `id`, containing elements - * in a range from `start` to `end` (exclusive) with a step value, with partition number + * Creates a [[DataFrame]] with a single [[LongType]] column named `id`, containing elements + * in an range from `start` to `end` (exclusive) with an step value, with partition number * specified. * - * @since 2.0.0 - * @group dataset + * @since 1.4.0 + * @group dataframe */ @Experimental - def range(start: Long, end: Long, step: Long, numPartitions: Int): Dataset[java.lang.Long] = { -sparkSession.range(start, end, step, numPartitions) + def range(start: Long, end: Long, step: Long, numPartitions: Int): DataFrame = { +sparkSession.range(start, end, step, numPartitions).toDF() } /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15881] Update microbenchmark results for WideSchemaBenchmark
Repository: spark Updated Branches: refs/heads/branch-2.0 4c7b208ab -> 304ec5de3 [SPARK-15881] Update microbenchmark results for WideSchemaBenchmark ## What changes were proposed in this pull request? These were not updated after performance improvements. To make updating them easier, I also moved the results from inline comments out into a file, which is auto-generated when the benchmark is re-run. Author: Eric Liang Closes #13607 from ericl/sc-3538. (cherry picked from commit 5bb4564cd47c8bf06409287e0de4ec45609970b2) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/304ec5de Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/304ec5de Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/304ec5de Branch: refs/heads/branch-2.0 Commit: 304ec5de34a998f83db5e565b80622184d68e7f7 Parents: 4c7b208 Author: Eric Liang Authored: Sat Jun 11 15:26:08 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 15:26:13 2016 -0700 -- project/SparkBuild.scala| 2 +- .../benchmarks/WideSchemaBenchmark-results.txt | 93 +++ sql/core/src/test/resources/log4j.properties| 2 +- .../benchmark/WideSchemaBenchmark.scala | 260 ++- 4 files changed, 123 insertions(+), 234 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/304ec5de/project/SparkBuild.scala -- diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 304288a..2f7da31 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -833,7 +833,7 @@ object TestSettings { javaOptions in Test += "-Dspark.ui.enabled=false", javaOptions in Test += "-Dspark.ui.showConsoleProgress=false", javaOptions in Test += "-Dspark.unsafe.exceptionOnMemoryLeak=true", -javaOptions in Test += "-Dsun.io.serialization.extendedDebugInfo=true", +javaOptions in Test += "-Dsun.io.serialization.extendedDebugInfo=false", javaOptions in Test += "-Dderby.system.durability=test", javaOptions in Test ++= System.getProperties.asScala.filter(_._1.startsWith("spark")) .map { case (k,v) => s"-D$k=$v" }.toSeq, http://git-wip-us.apache.org/repos/asf/spark/blob/304ec5de/sql/core/benchmarks/WideSchemaBenchmark-results.txt -- diff --git a/sql/core/benchmarks/WideSchemaBenchmark-results.txt b/sql/core/benchmarks/WideSchemaBenchmark-results.txt new file mode 100644 index 000..ea6a661 --- /dev/null +++ b/sql/core/benchmarks/WideSchemaBenchmark-results.txt @@ -0,0 +1,93 @@ +OpenJDK 64-Bit Server VM 1.8.0_66-internal-b17 on Linux 4.2.0-36-generic +Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz +parsing large select:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +1 select expressions 3 /5 0.0 2967064.0 1.0X +100 select expressions 11 / 12 0.0 11369518.0 0.3X +2500 select expressions243 / 250 0.0 242561004.0 0.0X + +OpenJDK 64-Bit Server VM 1.8.0_66-internal-b17 on Linux 4.2.0-36-generic +Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz +many column field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +1 cols x 10 rows (read in-mem) 28 / 40 3.6 278.8 1.0X +1 cols x 10 rows (exec in-mem) 28 / 42 3.5 284.0 1.0X +1 cols x 10 rows (read parquet) 23 / 35 4.4 228.8 1.2X +1 cols x 10 rows (write parquet) 163 / 182 0.6 1633.0 0.2X +100 cols x 1000 rows (read in-mem) 27 / 39 3.7 266.9 1.0X +100 cols x 1000 rows (exec in-mem) 48 / 79 2.1 481.7 0.6X +100 cols x 1000 rows (read parquet) 25 / 36 3.9 254.3 1.1X +100 cols x 1000 rows (write parquet) 182 / 196 0.5 1819.5 0.2X +2500 cols x 40 rows (read in-mem) 280 / 315 0.4 2797.1 0.1X +2500 cols x 40 rows (exec in-mem) 606 / 638 0.2 6064.3 0.0X +2500 cols x 40 rows (read parquet) 836 / 843 0.1 8356.4 0.0X +2500 cols x 40 rows (write parquet)490 / 522 0.2 4900.6 0.1X + +OpenJDK 64-Bit Server VM 1.8.0_66-internal-b17 on
spark git commit: [SPARK-15881] Update microbenchmark results for WideSchemaBenchmark
Repository: spark Updated Branches: refs/heads/master cb5d933d8 -> 5bb4564cd [SPARK-15881] Update microbenchmark results for WideSchemaBenchmark ## What changes were proposed in this pull request? These were not updated after performance improvements. To make updating them easier, I also moved the results from inline comments out into a file, which is auto-generated when the benchmark is re-run. Author: Eric Liang Closes #13607 from ericl/sc-3538. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5bb4564c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5bb4564c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5bb4564c Branch: refs/heads/master Commit: 5bb4564cd47c8bf06409287e0de4ec45609970b2 Parents: cb5d933 Author: Eric Liang Authored: Sat Jun 11 15:26:08 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 15:26:08 2016 -0700 -- project/SparkBuild.scala| 2 +- .../benchmarks/WideSchemaBenchmark-results.txt | 93 +++ sql/core/src/test/resources/log4j.properties| 2 +- .../benchmark/WideSchemaBenchmark.scala | 260 ++- 4 files changed, 123 insertions(+), 234 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5bb4564c/project/SparkBuild.scala -- diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 304288a..2f7da31 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -833,7 +833,7 @@ object TestSettings { javaOptions in Test += "-Dspark.ui.enabled=false", javaOptions in Test += "-Dspark.ui.showConsoleProgress=false", javaOptions in Test += "-Dspark.unsafe.exceptionOnMemoryLeak=true", -javaOptions in Test += "-Dsun.io.serialization.extendedDebugInfo=true", +javaOptions in Test += "-Dsun.io.serialization.extendedDebugInfo=false", javaOptions in Test += "-Dderby.system.durability=test", javaOptions in Test ++= System.getProperties.asScala.filter(_._1.startsWith("spark")) .map { case (k,v) => s"-D$k=$v" }.toSeq, http://git-wip-us.apache.org/repos/asf/spark/blob/5bb4564c/sql/core/benchmarks/WideSchemaBenchmark-results.txt -- diff --git a/sql/core/benchmarks/WideSchemaBenchmark-results.txt b/sql/core/benchmarks/WideSchemaBenchmark-results.txt new file mode 100644 index 000..ea6a661 --- /dev/null +++ b/sql/core/benchmarks/WideSchemaBenchmark-results.txt @@ -0,0 +1,93 @@ +OpenJDK 64-Bit Server VM 1.8.0_66-internal-b17 on Linux 4.2.0-36-generic +Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz +parsing large select:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +1 select expressions 3 /5 0.0 2967064.0 1.0X +100 select expressions 11 / 12 0.0 11369518.0 0.3X +2500 select expressions243 / 250 0.0 242561004.0 0.0X + +OpenJDK 64-Bit Server VM 1.8.0_66-internal-b17 on Linux 4.2.0-36-generic +Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz +many column field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +1 cols x 10 rows (read in-mem) 28 / 40 3.6 278.8 1.0X +1 cols x 10 rows (exec in-mem) 28 / 42 3.5 284.0 1.0X +1 cols x 10 rows (read parquet) 23 / 35 4.4 228.8 1.2X +1 cols x 10 rows (write parquet) 163 / 182 0.6 1633.0 0.2X +100 cols x 1000 rows (read in-mem) 27 / 39 3.7 266.9 1.0X +100 cols x 1000 rows (exec in-mem) 48 / 79 2.1 481.7 0.6X +100 cols x 1000 rows (read parquet) 25 / 36 3.9 254.3 1.1X +100 cols x 1000 rows (write parquet) 182 / 196 0.5 1819.5 0.2X +2500 cols x 40 rows (read in-mem) 280 / 315 0.4 2797.1 0.1X +2500 cols x 40 rows (exec in-mem) 606 / 638 0.2 6064.3 0.0X +2500 cols x 40 rows (read parquet) 836 / 843 0.1 8356.4 0.0X +2500 cols x 40 rows (write parquet)490 / 522 0.2 4900.6 0.1X + +OpenJDK 64-Bit Server VM 1.8.0_66-internal-b17 on Linux 4.2.0-36-generic +Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz +wide shallowly nested struct field r/w:
spark git commit: [SPARK-15585][SQL] Add doc for turning off quotations
Repository: spark Updated Branches: refs/heads/master ad102af16 -> cb5d933d8 [SPARK-15585][SQL] Add doc for turning off quotations ## What changes were proposed in this pull request? This pr is to add doc for turning off quotations because this behavior is different from `com.databricks.spark.csv`. ## How was this patch tested? Check behavior to put an empty string in csv options. Author: Takeshi YAMAMURO Closes #13616 from maropu/SPARK-15585-2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb5d933d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb5d933d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb5d933d Branch: refs/heads/master Commit: cb5d933d86ac4afd947874f1f1c31c7154cb8249 Parents: ad102af Author: Takeshi YAMAMURO Authored: Sat Jun 11 15:12:21 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 15:12:21 2016 -0700 -- python/pyspark/sql/readwriter.py | 6 -- .../main/scala/org/apache/spark/sql/DataFrameReader.scala | 4 +++- .../spark/sql/execution/datasources/csv/CSVSuite.scala| 10 ++ 3 files changed, 17 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cb5d933d/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 9208a52..7d1f186 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -320,7 +320,8 @@ class DataFrameReader(object): it uses the default value, ``UTF-8``. :param quote: sets the single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default - value, ``"``. + value, ``"``. If you would like to turn off quotations, you need to set an + empty string. :param escape: sets the single character used for escaping quotes inside an already quoted value. If None is set, it uses the default value, ``\``. :param comment: sets the single character used for skipping lines beginning with this @@ -804,7 +805,8 @@ class DataFrameWriter(object): set, it uses the default value, ``,``. :param quote: sets the single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default - value, ``"``. + value, ``"``. If you would like to turn off quotations, you need to set an + empty string. :param escape: sets the single character used for escaping quotes inside an already quoted value. If None is set, it uses the default value, ``\`` :param escapeQuotes: A flag indicating whether values containing quotes should always http://git-wip-us.apache.org/repos/asf/spark/blob/cb5d933d/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index b248583..bb5fa2b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -370,7 +370,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * `encoding` (default `UTF-8`): decodes the CSV files by the given encoding * type. * `quote` (default `"`): sets the single character used for escaping quoted values where - * the separator can be part of the value. + * the separator can be part of the value. If you would like to turn off quotations, you need to + * set not `null` but an empty string. This behaviour is different form + * `com.databricks.spark.csv`. * `escape` (default `\`): sets the single character used for escaping quotes inside * an already quoted value. * `comment` (default empty string): sets the single character used for skipping lines http://git-wip-us.apache.org/repos/asf/spark/blob/cb5d933d/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index bc95446..f170065 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datas
spark git commit: [SPARK-15585][SQL] Add doc for turning off quotations
Repository: spark Updated Branches: refs/heads/branch-2.0 8cf33fb8a -> 4c7b208ab [SPARK-15585][SQL] Add doc for turning off quotations ## What changes were proposed in this pull request? This pr is to add doc for turning off quotations because this behavior is different from `com.databricks.spark.csv`. ## How was this patch tested? Check behavior to put an empty string in csv options. Author: Takeshi YAMAMURO Closes #13616 from maropu/SPARK-15585-2. (cherry picked from commit cb5d933d86ac4afd947874f1f1c31c7154cb8249) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4c7b208a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4c7b208a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4c7b208a Branch: refs/heads/branch-2.0 Commit: 4c7b208ab6a6ae17fa137627c90256d757ad433f Parents: 8cf33fb Author: Takeshi YAMAMURO Authored: Sat Jun 11 15:12:21 2016 -0700 Committer: Reynold Xin Committed: Sat Jun 11 15:12:27 2016 -0700 -- python/pyspark/sql/readwriter.py | 6 -- .../main/scala/org/apache/spark/sql/DataFrameReader.scala | 4 +++- .../spark/sql/execution/datasources/csv/CSVSuite.scala| 10 ++ 3 files changed, 17 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4c7b208a/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 9208a52..7d1f186 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -320,7 +320,8 @@ class DataFrameReader(object): it uses the default value, ``UTF-8``. :param quote: sets the single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default - value, ``"``. + value, ``"``. If you would like to turn off quotations, you need to set an + empty string. :param escape: sets the single character used for escaping quotes inside an already quoted value. If None is set, it uses the default value, ``\``. :param comment: sets the single character used for skipping lines beginning with this @@ -804,7 +805,8 @@ class DataFrameWriter(object): set, it uses the default value, ``,``. :param quote: sets the single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default - value, ``"``. + value, ``"``. If you would like to turn off quotations, you need to set an + empty string. :param escape: sets the single character used for escaping quotes inside an already quoted value. If None is set, it uses the default value, ``\`` :param escapeQuotes: A flag indicating whether values containing quotes should always http://git-wip-us.apache.org/repos/asf/spark/blob/4c7b208a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index b248583..bb5fa2b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -370,7 +370,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * `encoding` (default `UTF-8`): decodes the CSV files by the given encoding * type. * `quote` (default `"`): sets the single character used for escaping quoted values where - * the separator can be part of the value. + * the separator can be part of the value. If you would like to turn off quotations, you need to + * set not `null` but an empty string. This behaviour is different form + * `com.databricks.spark.csv`. * `escape` (default `\`): sets the single character used for escaping quotes inside * an already quoted value. * `comment` (default empty string): sets the single character used for skipping lines http://git-wip-us.apache.org/repos/asf/spark/blob/4c7b208a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSui
spark git commit: [SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documents
Repository: spark Updated Branches: refs/heads/branch-2.0 4c29c55f2 -> 8cf33fb8a [SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documents ## What changes were proposed in this pull request? This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, this contains some editorial change. **Fix broken links** * mllib-data-types.md * mllib-decision-tree.md * mllib-ensembles.md * mllib-feature-extraction.md * mllib-pmml-model-export.md * mllib-statistics.md **Fix malformed section header and scala coding style** * mllib-linear-methods.md **Replace indirect forward links with direct one** * ml-classification-regression.md ## How was this patch tested? Manual tests (with `cd docs; jekyll build`.) Author: Dongjoon Hyun Closes #13608 from dongjoon-hyun/SPARK-15883. (cherry picked from commit ad102af169c7344b30d3b84aa16452fcdc22542c) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8cf33fb8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8cf33fb8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8cf33fb8 Branch: refs/heads/branch-2.0 Commit: 8cf33fb8a945e8f76833f68fc99b1ad5dee13641 Parents: 4c29c55 Author: Dongjoon Hyun Authored: Sat Jun 11 12:55:38 2016 +0100 Committer: Sean Owen Committed: Sat Jun 11 12:55:48 2016 +0100 -- docs/ml-classification-regression.md | 4 ++-- docs/mllib-data-types.md | 16 ++-- docs/mllib-decision-tree.md | 6 +++--- docs/mllib-ensembles.md | 6 +++--- docs/mllib-feature-extraction.md | 2 +- docs/mllib-linear-methods.md | 10 +- docs/mllib-pmml-model-export.md | 2 +- docs/mllib-statistics.md | 8 8 files changed, 25 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8cf33fb8/docs/ml-classification-regression.md -- diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md index 88457d4..d7e5521 100644 --- a/docs/ml-classification-regression.md +++ b/docs/ml-classification-regression.md @@ -815,7 +815,7 @@ The main differences between this API and the [original MLlib ensembles API](mll ## Random Forests [Random forests](http://en.wikipedia.org/wiki/Random_forest) -are ensembles of [decision trees](ml-decision-tree.html). +are ensembles of [decision trees](ml-classification-regression.html#decision-trees). Random forests combine many decision trees in order to reduce the risk of overfitting. The `spark.ml` implementation supports random forests for binary and multiclass classification and for regression, using both continuous and categorical features. @@ -896,7 +896,7 @@ All output columns are optional; to exclude an output column, set its correspond ## Gradient-Boosted Trees (GBTs) [Gradient-Boosted Trees (GBTs)](http://en.wikipedia.org/wiki/Gradient_boosting) -are ensembles of [decision trees](ml-decision-tree.html). +are ensembles of [decision trees](ml-classification-regression.html#decision-trees). GBTs iteratively train decision trees in order to minimize a loss function. The `spark.ml` implementation supports GBTs for binary classification and for regression, using both continuous and categorical features. http://git-wip-us.apache.org/repos/asf/spark/blob/8cf33fb8/docs/mllib-data-types.md -- diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md index 2ffe0f1..ef56aeb 100644 --- a/docs/mllib-data-types.md +++ b/docs/mllib-data-types.md @@ -33,7 +33,7 @@ implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.lin using the factory methods implemented in [`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) to create local vectors. -Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors) for details on the API. +Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details on the API. {% highlight scala %} import org.apache.spark.mllib.linalg.{Vector, Vectors} @@ -199,7 +199,7 @@ After loading, the feature indices are converted to zero-based. [`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training examples stored in LIBSVM format. -Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils) for details on the API. +Refer to the [`MLUtils` Scala docs](api/scala
spark git commit: [SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documents
Repository: spark Updated Branches: refs/heads/master 3761330dd -> ad102af16 [SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documents ## What changes were proposed in this pull request? This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, this contains some editorial change. **Fix broken links** * mllib-data-types.md * mllib-decision-tree.md * mllib-ensembles.md * mllib-feature-extraction.md * mllib-pmml-model-export.md * mllib-statistics.md **Fix malformed section header and scala coding style** * mllib-linear-methods.md **Replace indirect forward links with direct one** * ml-classification-regression.md ## How was this patch tested? Manual tests (with `cd docs; jekyll build`.) Author: Dongjoon Hyun Closes #13608 from dongjoon-hyun/SPARK-15883. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ad102af1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ad102af1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ad102af1 Branch: refs/heads/master Commit: ad102af169c7344b30d3b84aa16452fcdc22542c Parents: 3761330 Author: Dongjoon Hyun Authored: Sat Jun 11 12:55:38 2016 +0100 Committer: Sean Owen Committed: Sat Jun 11 12:55:38 2016 +0100 -- docs/ml-classification-regression.md | 4 ++-- docs/mllib-data-types.md | 16 ++-- docs/mllib-decision-tree.md | 6 +++--- docs/mllib-ensembles.md | 6 +++--- docs/mllib-feature-extraction.md | 2 +- docs/mllib-linear-methods.md | 10 +- docs/mllib-pmml-model-export.md | 2 +- docs/mllib-statistics.md | 8 8 files changed, 25 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ad102af1/docs/ml-classification-regression.md -- diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md index 88457d4..d7e5521 100644 --- a/docs/ml-classification-regression.md +++ b/docs/ml-classification-regression.md @@ -815,7 +815,7 @@ The main differences between this API and the [original MLlib ensembles API](mll ## Random Forests [Random forests](http://en.wikipedia.org/wiki/Random_forest) -are ensembles of [decision trees](ml-decision-tree.html). +are ensembles of [decision trees](ml-classification-regression.html#decision-trees). Random forests combine many decision trees in order to reduce the risk of overfitting. The `spark.ml` implementation supports random forests for binary and multiclass classification and for regression, using both continuous and categorical features. @@ -896,7 +896,7 @@ All output columns are optional; to exclude an output column, set its correspond ## Gradient-Boosted Trees (GBTs) [Gradient-Boosted Trees (GBTs)](http://en.wikipedia.org/wiki/Gradient_boosting) -are ensembles of [decision trees](ml-decision-tree.html). +are ensembles of [decision trees](ml-classification-regression.html#decision-trees). GBTs iteratively train decision trees in order to minimize a loss function. The `spark.ml` implementation supports GBTs for binary classification and for regression, using both continuous and categorical features. http://git-wip-us.apache.org/repos/asf/spark/blob/ad102af1/docs/mllib-data-types.md -- diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md index 2ffe0f1..ef56aeb 100644 --- a/docs/mllib-data-types.md +++ b/docs/mllib-data-types.md @@ -33,7 +33,7 @@ implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.lin using the factory methods implemented in [`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) to create local vectors. -Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors) for details on the API. +Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details on the API. {% highlight scala %} import org.apache.spark.mllib.linalg.{Vector, Vectors} @@ -199,7 +199,7 @@ After loading, the feature indices are converted to zero-based. [`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training examples stored in LIBSVM format. -Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils) for details on the API. +Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for details on the API. {% highlight scala %} imp
spark git commit: [SPARK-15879][DOCS][UI] Update logo in UI and docs to add "Apache"
Repository: spark Updated Branches: refs/heads/branch-2.0 f0fa0a894 -> 4c29c55f2 [SPARK-15879][DOCS][UI] Update logo in UI and docs to add "Apache" ## What changes were proposed in this pull request? Use new Spark logo including "Apache" (now, with crushed PNGs). Remove old unreferenced logo files. ## How was this patch tested? Manual check of generated HTML site and Spark UI. I searched for references to the deleted files to make sure they were not used. Author: Sean Owen Closes #13609 from srowen/SPARK-15879. (cherry picked from commit 3761330dd0151d7369d7fba4d4c344e9863990ef) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4c29c55f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4c29c55f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4c29c55f Branch: refs/heads/branch-2.0 Commit: 4c29c55f22d57c5fbadd0b759155fbab4b07a70a Parents: f0fa0a8 Author: Sean Owen Authored: Sat Jun 11 12:46:07 2016 +0100 Committer: Sean Owen Committed: Sat Jun 11 12:46:21 2016 +0100 -- .../spark/ui/static/spark-logo-77x50px-hd.png | Bin 3536 -> 4182 bytes .../org/apache/spark/ui/static/spark_logo.png | Bin 14233 -> 0 bytes docs/img/incubator-logo.png | Bin 11651 -> 0 bytes docs/img/spark-logo-100x40px.png| Bin 3635 -> 0 bytes docs/img/spark-logo-77x40px-hd.png | Bin 1904 -> 0 bytes docs/img/spark-logo-77x50px-hd.png | Bin 3536 -> 0 bytes docs/img/spark-logo-hd.png | Bin 13512 -> 16418 bytes 7 files changed, 0 insertions(+), 0 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4c29c55f/core/src/main/resources/org/apache/spark/ui/static/spark-logo-77x50px-hd.png -- diff --git a/core/src/main/resources/org/apache/spark/ui/static/spark-logo-77x50px-hd.png b/core/src/main/resources/org/apache/spark/ui/static/spark-logo-77x50px-hd.png index 6c5f099..ffe2550 100644 Binary files a/core/src/main/resources/org/apache/spark/ui/static/spark-logo-77x50px-hd.png and b/core/src/main/resources/org/apache/spark/ui/static/spark-logo-77x50px-hd.png differ http://git-wip-us.apache.org/repos/asf/spark/blob/4c29c55f/core/src/main/resources/org/apache/spark/ui/static/spark_logo.png -- diff --git a/core/src/main/resources/org/apache/spark/ui/static/spark_logo.png b/core/src/main/resources/org/apache/spark/ui/static/spark_logo.png deleted file mode 100644 index 4b18734..000 Binary files a/core/src/main/resources/org/apache/spark/ui/static/spark_logo.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/4c29c55f/docs/img/incubator-logo.png -- diff --git a/docs/img/incubator-logo.png b/docs/img/incubator-logo.png deleted file mode 100644 index 33ca7f6..000 Binary files a/docs/img/incubator-logo.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/4c29c55f/docs/img/spark-logo-100x40px.png -- diff --git a/docs/img/spark-logo-100x40px.png b/docs/img/spark-logo-100x40px.png deleted file mode 100644 index 54c3187..000 Binary files a/docs/img/spark-logo-100x40px.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/4c29c55f/docs/img/spark-logo-77x40px-hd.png -- diff --git a/docs/img/spark-logo-77x40px-hd.png b/docs/img/spark-logo-77x40px-hd.png deleted file mode 100644 index 270402f..000 Binary files a/docs/img/spark-logo-77x40px-hd.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/4c29c55f/docs/img/spark-logo-77x50px-hd.png -- diff --git a/docs/img/spark-logo-77x50px-hd.png b/docs/img/spark-logo-77x50px-hd.png deleted file mode 100644 index 6c5f099..000 Binary files a/docs/img/spark-logo-77x50px-hd.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/4c29c55f/docs/img/spark-logo-hd.png -- diff --git a/docs/img/spark-logo-hd.png b/docs/img/spark-logo-hd.png index 1381e30..e4508e7 100644 Binary files a/docs/img/spark-logo-hd.png and b/docs/img/spark-logo-hd.png differ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15879][DOCS][UI] Update logo in UI and docs to add "Apache"
Repository: spark Updated Branches: refs/heads/master 7504bc73f -> 3761330dd [SPARK-15879][DOCS][UI] Update logo in UI and docs to add "Apache" ## What changes were proposed in this pull request? Use new Spark logo including "Apache" (now, with crushed PNGs). Remove old unreferenced logo files. ## How was this patch tested? Manual check of generated HTML site and Spark UI. I searched for references to the deleted files to make sure they were not used. Author: Sean Owen Closes #13609 from srowen/SPARK-15879. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3761330d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3761330d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3761330d Branch: refs/heads/master Commit: 3761330dd0151d7369d7fba4d4c344e9863990ef Parents: 7504bc7 Author: Sean Owen Authored: Sat Jun 11 12:46:07 2016 +0100 Committer: Sean Owen Committed: Sat Jun 11 12:46:07 2016 +0100 -- .../spark/ui/static/spark-logo-77x50px-hd.png | Bin 3536 -> 4182 bytes .../org/apache/spark/ui/static/spark_logo.png | Bin 14233 -> 0 bytes docs/img/incubator-logo.png | Bin 11651 -> 0 bytes docs/img/spark-logo-100x40px.png| Bin 3635 -> 0 bytes docs/img/spark-logo-77x40px-hd.png | Bin 1904 -> 0 bytes docs/img/spark-logo-77x50px-hd.png | Bin 3536 -> 0 bytes docs/img/spark-logo-hd.png | Bin 13512 -> 16418 bytes 7 files changed, 0 insertions(+), 0 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3761330d/core/src/main/resources/org/apache/spark/ui/static/spark-logo-77x50px-hd.png -- diff --git a/core/src/main/resources/org/apache/spark/ui/static/spark-logo-77x50px-hd.png b/core/src/main/resources/org/apache/spark/ui/static/spark-logo-77x50px-hd.png index 6c5f099..ffe2550 100644 Binary files a/core/src/main/resources/org/apache/spark/ui/static/spark-logo-77x50px-hd.png and b/core/src/main/resources/org/apache/spark/ui/static/spark-logo-77x50px-hd.png differ http://git-wip-us.apache.org/repos/asf/spark/blob/3761330d/core/src/main/resources/org/apache/spark/ui/static/spark_logo.png -- diff --git a/core/src/main/resources/org/apache/spark/ui/static/spark_logo.png b/core/src/main/resources/org/apache/spark/ui/static/spark_logo.png deleted file mode 100644 index 4b18734..000 Binary files a/core/src/main/resources/org/apache/spark/ui/static/spark_logo.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/3761330d/docs/img/incubator-logo.png -- diff --git a/docs/img/incubator-logo.png b/docs/img/incubator-logo.png deleted file mode 100644 index 33ca7f6..000 Binary files a/docs/img/incubator-logo.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/3761330d/docs/img/spark-logo-100x40px.png -- diff --git a/docs/img/spark-logo-100x40px.png b/docs/img/spark-logo-100x40px.png deleted file mode 100644 index 54c3187..000 Binary files a/docs/img/spark-logo-100x40px.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/3761330d/docs/img/spark-logo-77x40px-hd.png -- diff --git a/docs/img/spark-logo-77x40px-hd.png b/docs/img/spark-logo-77x40px-hd.png deleted file mode 100644 index 270402f..000 Binary files a/docs/img/spark-logo-77x40px-hd.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/3761330d/docs/img/spark-logo-77x50px-hd.png -- diff --git a/docs/img/spark-logo-77x50px-hd.png b/docs/img/spark-logo-77x50px-hd.png deleted file mode 100644 index 6c5f099..000 Binary files a/docs/img/spark-logo-77x50px-hd.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/3761330d/docs/img/spark-logo-hd.png -- diff --git a/docs/img/spark-logo-hd.png b/docs/img/spark-logo-hd.png index 1381e30..e4508e7 100644 Binary files a/docs/img/spark-logo-hd.png and b/docs/img/spark-logo-hd.png differ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org