spark git commit: [SPARK-17190][SQL] Removal of HiveSharedState
Repository: spark Updated Branches: refs/heads/master ac27557eb -> 4d0706d61 [SPARK-17190][SQL] Removal of HiveSharedState ### What changes were proposed in this pull request? Since `HiveClient` is used to interact with the Hive metastore, it should be hidden in `HiveExternalCatalog`. After moving `HiveClient` into `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes straightforward. After removal of `HiveSharedState`, the reflection logic is directly applied on the choice of `ExternalCatalog` types, based on the configuration of `CATALOG_IMPLEMENTATION`. ~~`HiveClient` is also used/invoked by the other entities besides HiveExternalCatalog, we defines the following two APIs: getClient and getNewClient~~ ### How was this patch tested? The existing test cases Author: gatorsmileCloses #14757 from gatorsmile/removeHiveClient. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4d0706d6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4d0706d6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4d0706d6 Branch: refs/heads/master Commit: 4d0706d616176dc29ff3562e40cb00dd4eb9c302 Parents: ac27557 Author: gatorsmile Authored: Thu Aug 25 12:50:03 2016 +0800 Committer: Wenchen Fan Committed: Thu Aug 25 12:50:03 2016 +0800 -- .../sql/catalyst/catalog/InMemoryCatalog.scala | 8 +++- .../org/apache/spark/sql/SparkSession.scala | 14 +- .../apache/spark/sql/internal/SharedState.scala | 47 +++- .../hive/thriftserver/HiveThriftServer2.scala | 2 +- .../org/apache/spark/sql/hive/HiveContext.scala | 4 -- .../spark/sql/hive/HiveExternalCatalog.scala| 10 - .../spark/sql/hive/HiveMetastoreCatalog.scala | 3 +- .../spark/sql/hive/HiveSessionState.scala | 9 ++-- .../apache/spark/sql/hive/HiveSharedState.scala | 47 .../apache/spark/sql/hive/test/TestHive.scala | 15 +++ .../spark/sql/hive/HiveDataFrameSuite.scala | 2 +- .../sql/hive/HiveExternalCatalogSuite.scala | 16 +++ .../spark/sql/hive/HiveSparkSubmitSuite.scala | 5 +-- .../sql/hive/MetastoreDataSourcesSuite.scala| 3 +- .../spark/sql/hive/ShowCreateTableSuite.scala | 2 +- 15 files changed, 88 insertions(+), 99 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4d0706d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala index 9ebf7de..b55ddcb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala @@ -24,7 +24,7 @@ import scala.collection.mutable import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.Path -import org.apache.spark.SparkException +import org.apache.spark.{SparkConf, SparkException} import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier} import org.apache.spark.sql.catalyst.analysis._ @@ -39,7 +39,11 @@ import org.apache.spark.sql.catalyst.util.StringUtils * * All public methods should be synchronized for thread-safety. */ -class InMemoryCatalog(hadoopConfig: Configuration = new Configuration) extends ExternalCatalog { +class InMemoryCatalog( +conf: SparkConf = new SparkConf, +hadoopConfig: Configuration = new Configuration) + extends ExternalCatalog { + import CatalogTypes.TablePartitionSpec private class TableDesc(var table: CatalogTable) { http://git-wip-us.apache.org/repos/asf/spark/blob/4d0706d6/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala index 362bf45..0f6292d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala @@ -96,10 +96,7 @@ class SparkSession private( */ @transient private[sql] lazy val sharedState: SharedState = { -existingSharedState.getOrElse( - SparkSession.reflect[SharedState, SparkContext]( -SparkSession.sharedStateClassName(sparkContext.conf), -sparkContext)) +existingSharedState.getOrElse(new SharedState(sparkContext)) } /** @@
spark git commit: [SPARK-17228][SQL] Not infer/propagate non-deterministic constraints
Repository: spark Updated Branches: refs/heads/branch-2.0 3258f27a8 -> aa57083af [SPARK-17228][SQL] Not infer/propagate non-deterministic constraints ## What changes were proposed in this pull request? Given that filters based on non-deterministic constraints shouldn't be pushed down in the query plan, unnecessarily inferring them is confusing and a source of potential bugs. This patch simplifies the inferring logic by simply ignoring them. ## How was this patch tested? Added a new test in `ConstraintPropagationSuite`. Author: Sameer AgarwalCloses #14795 from sameeragarwal/deterministic-constraints. (cherry picked from commit ac27557eb622a257abeb3e8551f06ebc72f87133) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aa57083a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aa57083a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aa57083a Branch: refs/heads/branch-2.0 Commit: aa57083af4cecb595bac09e437607d7142b54913 Parents: 3258f27 Author: Sameer Agarwal Authored: Wed Aug 24 21:24:24 2016 -0700 Committer: Reynold Xin Committed: Wed Aug 24 21:24:31 2016 -0700 -- .../spark/sql/catalyst/plans/QueryPlan.scala | 3 ++- .../plans/ConstraintPropagationSuite.scala | 17 + 2 files changed, 19 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/aa57083a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala index cf34f4b..9c60590 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala @@ -35,7 +35,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT .union(inferAdditionalConstraints(constraints)) .union(constructIsNotNullConstraints(constraints)) .filter(constraint => -constraint.references.nonEmpty && constraint.references.subsetOf(outputSet)) +constraint.references.nonEmpty && constraint.references.subsetOf(outputSet) && + constraint.deterministic) } /** http://git-wip-us.apache.org/repos/asf/spark/blob/aa57083a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala index 5a76969..8d6a49a 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala @@ -352,4 +352,21 @@ class ConstraintPropagationSuite extends SparkFunSuite { verifyConstraints(tr.analyze.constraints, ExpressionSet(Seq(IsNotNull(resolveColumn(tr, "b")), IsNotNull(resolveColumn(tr, "c") } + + test("not infer non-deterministic constraints") { +val tr = LocalRelation('a.int, 'b.string, 'c.int) + +verifyConstraints(tr + .where('a.attr === Rand(0)) + .analyze.constraints, + ExpressionSet(Seq(IsNotNull(resolveColumn(tr, "a") + +verifyConstraints(tr + .where('a.attr === InputFileName()) + .where('a.attr =!= 'c.attr) + .analyze.constraints, + ExpressionSet(Seq(resolveColumn(tr, "a") =!= resolveColumn(tr, "c"), +IsNotNull(resolveColumn(tr, "a")), +IsNotNull(resolveColumn(tr, "c") + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17228][SQL] Not infer/propagate non-deterministic constraints
Repository: spark Updated Branches: refs/heads/master 3a60be4b1 -> ac27557eb [SPARK-17228][SQL] Not infer/propagate non-deterministic constraints ## What changes were proposed in this pull request? Given that filters based on non-deterministic constraints shouldn't be pushed down in the query plan, unnecessarily inferring them is confusing and a source of potential bugs. This patch simplifies the inferring logic by simply ignoring them. ## How was this patch tested? Added a new test in `ConstraintPropagationSuite`. Author: Sameer AgarwalCloses #14795 from sameeragarwal/deterministic-constraints. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ac27557e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ac27557e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ac27557e Branch: refs/heads/master Commit: ac27557eb622a257abeb3e8551f06ebc72f87133 Parents: 3a60be4 Author: Sameer Agarwal Authored: Wed Aug 24 21:24:24 2016 -0700 Committer: Reynold Xin Committed: Wed Aug 24 21:24:24 2016 -0700 -- .../spark/sql/catalyst/plans/QueryPlan.scala | 3 ++- .../plans/ConstraintPropagationSuite.scala | 17 + 2 files changed, 19 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ac27557e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala index 8ee31f4..0fb6e7d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala @@ -35,7 +35,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT .union(inferAdditionalConstraints(constraints)) .union(constructIsNotNullConstraints(constraints)) .filter(constraint => -constraint.references.nonEmpty && constraint.references.subsetOf(outputSet)) +constraint.references.nonEmpty && constraint.references.subsetOf(outputSet) && + constraint.deterministic) } /** http://git-wip-us.apache.org/repos/asf/spark/blob/ac27557e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala index 5a76969..8d6a49a 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala @@ -352,4 +352,21 @@ class ConstraintPropagationSuite extends SparkFunSuite { verifyConstraints(tr.analyze.constraints, ExpressionSet(Seq(IsNotNull(resolveColumn(tr, "b")), IsNotNull(resolveColumn(tr, "c") } + + test("not infer non-deterministic constraints") { +val tr = LocalRelation('a.int, 'b.string, 'c.int) + +verifyConstraints(tr + .where('a.attr === Rand(0)) + .analyze.constraints, + ExpressionSet(Seq(IsNotNull(resolveColumn(tr, "a") + +verifyConstraints(tr + .where('a.attr === InputFileName()) + .where('a.attr =!= 'c.attr) + .analyze.constraints, + ExpressionSet(Seq(resolveColumn(tr, "a") =!= resolveColumn(tr, "c"), +IsNotNull(resolveColumn(tr, "a")), +IsNotNull(resolveColumn(tr, "c") + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16216][SQL][BRANCH-2.0] Backport Read/write dateFormat/timestampFormat options for CSV and JSON
Repository: spark Updated Branches: refs/heads/branch-2.0 9f363a690 -> 3258f27a8 [SPARK-16216][SQL][BRANCH-2.0] Backport Read/write dateFormat/timestampFormat options for CSV and JSON ## What changes were proposed in this pull request? This PR backports https://github.com/apache/spark/pull/14279 to 2.0. ## How was this patch tested? Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests cover the default cases. Author: hyukjinkwonCloses #14799 from HyukjinKwon/SPARK-16216-json-csv-backport. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3258f27a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3258f27a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3258f27a Branch: refs/heads/branch-2.0 Commit: 3258f27a881dfeb5ab8bae90c338603fa4b6f9d8 Parents: 9f363a6 Author: hyukjinkwon Authored: Wed Aug 24 21:19:35 2016 -0700 Committer: Reynold Xin Committed: Wed Aug 24 21:19:35 2016 -0700 -- python/pyspark/sql/readwriter.py| 56 +-- python/pyspark/sql/streaming.py | 30 +++- .../org/apache/spark/sql/DataFrameReader.scala | 17 +- .../org/apache/spark/sql/DataFrameWriter.scala | 12 ++ .../datasources/csv/CSVInferSchema.scala| 42 ++--- .../execution/datasources/csv/CSVOptions.scala | 15 +- .../execution/datasources/csv/CSVRelation.scala | 43 - .../datasources/json/JSONOptions.scala | 9 ++ .../datasources/json/JacksonGenerator.scala | 14 +- .../datasources/json/JacksonParser.scala| 68 .../datasources/json/JsonFileFormat.scala | 5 +- .../spark/sql/streaming/DataStreamReader.scala | 16 +- .../datasources/csv/CSVInferSchemaSuite.scala | 4 +- .../execution/datasources/csv/CSVSuite.scala| 156 ++- .../datasources/csv/CSVTypeCastSuite.scala | 17 +- .../execution/datasources/json/JsonSuite.scala | 74 - .../datasources/json/TestJsonData.scala | 6 + .../sql/sources/JsonHadoopFsRelationSuite.scala | 4 + 18 files changed, 478 insertions(+), 110 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3258f27a/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 64de33e..3da6f49 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -156,7 +156,7 @@ class DataFrameReader(OptionUtils): def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, allowComments=None, allowUnquotedFieldNames=None, allowSingleQuotes=None, allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None, - mode=None, columnNameOfCorruptRecord=None): + mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None): """ Loads a JSON file (one object per line) or an RDD of Strings storing JSON objects (one object per record) and returns the result as a :class`DataFrame`. @@ -198,6 +198,14 @@ class DataFrameReader(OptionUtils): ``spark.sql.columnNameOfCorruptRecord``. If None is set, it uses the value specified in ``spark.sql.columnNameOfCorruptRecord``. +:param dateFormat: sets the string that indicates a date format. Custom date formats + follow the formats at ``java.text.SimpleDateFormat``. This + applies to date type. If None is set, it uses the + default value value, ``-MM-dd``. +:param timestampFormat: sets the string that indicates a timestamp format. Custom date +formats follow the formats at ``java.text.SimpleDateFormat``. +This applies to timestamp type. If None is set, it uses the +default value value, ``-MM-dd'T'HH:mm:ss.SSSZZ``. >>> df1 = spark.read.json('python/test_support/sql/people.json') >>> df1.dtypes @@ -213,7 +221,8 @@ class DataFrameReader(OptionUtils): allowComments=allowComments, allowUnquotedFieldNames=allowUnquotedFieldNames, allowSingleQuotes=allowSingleQuotes, allowNumericLeadingZero=allowNumericLeadingZero, allowBackslashEscapingAnyCharacter=allowBackslashEscapingAnyCharacter, -mode=mode, columnNameOfCorruptRecord=columnNameOfCorruptRecord) +mode=mode, columnNameOfCorruptRecord=columnNameOfCorruptRecord,
spark git commit: [SPARKR][MINOR] Add more examples to window function docs
Repository: spark Updated Branches: refs/heads/branch-2.0 9f924a01b -> 43273377a [SPARKR][MINOR] Add more examples to window function docs ## What changes were proposed in this pull request? This PR adds more examples to window function docs to make them more accessible to the users. It also fixes default value issues for `lag` and `lead`. ## How was this patch tested? Manual test, R unit test. Author: Junyang QianCloses #14779 from junyangq/SPARKR-FixWindowFunctionDocs. (cherry picked from commit 18708f76c366c6e01b5865981666e40d8642ac20) Signed-off-by: Felix Cheung Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/43273377 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/43273377 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/43273377 Branch: refs/heads/branch-2.0 Commit: 43273377a38a9136ff5e56929630930f076af5af Parents: 9f924a0 Author: Junyang Qian Authored: Wed Aug 24 16:00:04 2016 -0700 Committer: Felix Cheung Committed: Wed Aug 24 16:00:18 2016 -0700 -- R/pkg/R/WindowSpec.R | 12 R/pkg/R/functions.R | 78 --- 2 files changed, 72 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/43273377/R/pkg/R/WindowSpec.R -- diff --git a/R/pkg/R/WindowSpec.R b/R/pkg/R/WindowSpec.R index ddd2ef2..4ac83c2 100644 --- a/R/pkg/R/WindowSpec.R +++ b/R/pkg/R/WindowSpec.R @@ -203,6 +203,18 @@ setMethod("rangeBetween", #' @aliases over,Column,WindowSpec-method #' @family colum_func #' @export +#' @examples \dontrun{ +#' df <- createDataFrame(mtcars) +#' +#' # Partition by am (transmission) and order by hp (horsepower) +#' ws <- orderBy(windowPartitionBy("am"), "hp") +#' +#' # Rank on hp within each partition +#' out <- select(df, over(rank(), ws), df$hp, df$am) +#' +#' # Lag mpg values by 1 row on the partition-and-ordered table +#' out <- select(df, over(lead(df$mpg), ws), df$mpg, df$hp, df$am) +#' } #' @note over since 2.0.0 setMethod("over", signature(x = "Column", window = "WindowSpec"), http://git-wip-us.apache.org/repos/asf/spark/blob/43273377/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index f042add..dbf8dd8 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -3121,9 +3121,9 @@ setMethod("ifelse", #' @aliases cume_dist,missing-method #' @export #' @examples \dontrun{ -#' df <- createDataFrame(iris) -#' ws <- orderBy(windowPartitionBy("Species"), "Sepal_Length") -#' out <- select(df, over(cume_dist(), ws), df$Sepal_Length, df$Species) +#' df <- createDataFrame(mtcars) +#' ws <- orderBy(windowPartitionBy("am"), "hp") +#' out <- select(df, over(cume_dist(), ws), df$hp, df$am) #' } #' @note cume_dist since 1.6.0 setMethod("cume_dist", @@ -3148,7 +3148,11 @@ setMethod("cume_dist", #' @family window_funcs #' @aliases dense_rank,missing-method #' @export -#' @examples \dontrun{dense_rank()} +#' @examples \dontrun{ +#' df <- createDataFrame(mtcars) +#' ws <- orderBy(windowPartitionBy("am"), "hp") +#' out <- select(df, over(dense_rank(), ws), df$hp, df$am) +#' } #' @note dense_rank since 1.6.0 setMethod("dense_rank", signature("missing"), @@ -3168,18 +3172,26 @@ setMethod("dense_rank", #' @param x the column as a character string or a Column to compute on. #' @param offset the number of rows back from the current row from which to obtain a value. #' If not specified, the default is 1. -#' @param defaultValue default to use when the offset row does not exist. +#' @param defaultValue (optional) default to use when the offset row does not exist. #' @param ... further arguments to be passed to or from other methods. #' @rdname lag #' @name lag #' @aliases lag,characterOrColumn-method #' @family window_funcs #' @export -#' @examples \dontrun{lag(df$c)} +#' @examples \dontrun{ +#' df <- createDataFrame(mtcars) +#' +#' # Partition by am (transmission) and order by hp (horsepower) +#' ws <- orderBy(windowPartitionBy("am"), "hp") +#' +#' # Lag mpg values by 1 row on the partition-and-ordered table +#' out <- select(df, over(lag(df$mpg), ws), df$mpg, df$hp, df$am) +#' } #' @note lag since 1.6.0 setMethod("lag", signature(x = "characterOrColumn"), - function(x, offset, defaultValue = NULL) { + function(x, offset = 1, defaultValue = NULL) { col <- if (class(x) == "Column") { x@jc } else { @@ -3194,25 +3206,35 @@ setMethod("lag", #' lead #' #' Window function:
spark git commit: [SPARKR][MINOR] Add installation message for remote master mode and improve other messages
Repository: spark Updated Branches: refs/heads/master 18708f76c -> 3a60be4b1 [SPARKR][MINOR] Add installation message for remote master mode and improve other messages ## What changes were proposed in this pull request? This PR gives informative message to users when they try to connect to a remote master but don't have Spark package in their local machine. As a clarification, for now, automatic installation will only happen if they start SparkR in R console (rather than from sparkr-shell) and connect to local master. In the remote master mode, local Spark package is still needed, but we will not trigger the install.spark function because the versions have to match those on the cluster, which involves more user input. Instead, we here try to provide detailed message that may help the users. Some of the other messages have also been slightly changed. ## How was this patch tested? Manual test. Author: Junyang QianCloses #14761 from junyangq/SPARK-16579-V1. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3a60be4b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3a60be4b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3a60be4b Branch: refs/heads/master Commit: 3a60be4b15a5ab9b6e0c4839df99dac7738aa7fe Parents: 18708f7 Author: Junyang Qian Authored: Wed Aug 24 16:04:14 2016 -0700 Committer: Felix Cheung Committed: Wed Aug 24 16:04:14 2016 -0700 -- R/pkg/R/install.R | 64 ++ R/pkg/R/sparkR.R | 51 ++-- R/pkg/R/utils.R | 4 ++-- 3 files changed, 80 insertions(+), 39 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3a60be4b/R/pkg/R/install.R -- diff --git a/R/pkg/R/install.R b/R/pkg/R/install.R index c6ed88e..69b0a52 100644 --- a/R/pkg/R/install.R +++ b/R/pkg/R/install.R @@ -70,9 +70,9 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl = NULL, localDir = NULL, overwrite = FALSE) { version <- paste0("spark-", packageVersion("SparkR")) hadoopVersion <- tolower(hadoopVersion) - hadoopVersionName <- hadoop_version_name(hadoopVersion) + hadoopVersionName <- hadoopVersionName(hadoopVersion) packageName <- paste(version, "bin", hadoopVersionName, sep = "-") - localDir <- ifelse(is.null(localDir), spark_cache_path(), + localDir <- ifelse(is.null(localDir), sparkCachePath(), normalizePath(localDir, mustWork = FALSE)) if (is.na(file.info(localDir)$isdir)) { @@ -88,12 +88,14 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl = NULL, # can use dir.exists(packageLocalDir) under R 3.2.0 or later if (!is.na(file.info(packageLocalDir)$isdir) && !overwrite) { -fmt <- "Spark %s for Hadoop %s is found, and SPARK_HOME set to %s" +fmt <- "%s for Hadoop %s found, with SPARK_HOME set to %s" msg <- sprintf(fmt, version, ifelse(hadoopVersion == "without", "Free build", hadoopVersion), packageLocalDir) message(msg) Sys.setenv(SPARK_HOME = packageLocalDir) return(invisible(packageLocalDir)) + } else { +message("Spark not found in the cache directory. Installation will start.") } packageLocalPath <- paste0(packageLocalDir, ".tgz") @@ -102,7 +104,7 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl = NULL, if (tarExists && !overwrite) { message("tar file found.") } else { -robust_download_tar(mirrorUrl, version, hadoopVersion, packageName, packageLocalPath) +robustDownloadTar(mirrorUrl, version, hadoopVersion, packageName, packageLocalPath) } message(sprintf("Installing to %s", localDir)) @@ -116,33 +118,37 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl = NULL, invisible(packageLocalDir) } -robust_download_tar <- function(mirrorUrl, version, hadoopVersion, packageName, packageLocalPath) { +robustDownloadTar <- function(mirrorUrl, version, hadoopVersion, packageName, packageLocalPath) { # step 1: use user-provided url if (!is.null(mirrorUrl)) { msg <- sprintf("Use user-provided mirror site: %s.", mirrorUrl) message(msg) -success <- direct_download_tar(mirrorUrl, version, hadoopVersion, +success <- directDownloadTar(mirrorUrl, version, hadoopVersion, packageName, packageLocalPath) -if (success) return() +if (success) { + return() +} else { + message(paste0("Unable to download from mirrorUrl: ", mirrorUrl)) +} } else { -message("Mirror site not provided.") +message("MirrorUrl not provided.") } # step 2: use url suggested from
spark git commit: [SPARKR][MINOR] Add installation message for remote master mode and improve other messages
Repository: spark Updated Branches: refs/heads/branch-2.0 43273377a -> 9f363a690 [SPARKR][MINOR] Add installation message for remote master mode and improve other messages ## What changes were proposed in this pull request? This PR gives informative message to users when they try to connect to a remote master but don't have Spark package in their local machine. As a clarification, for now, automatic installation will only happen if they start SparkR in R console (rather than from sparkr-shell) and connect to local master. In the remote master mode, local Spark package is still needed, but we will not trigger the install.spark function because the versions have to match those on the cluster, which involves more user input. Instead, we here try to provide detailed message that may help the users. Some of the other messages have also been slightly changed. ## How was this patch tested? Manual test. Author: Junyang QianCloses #14761 from junyangq/SPARK-16579-V1. (cherry picked from commit 3a60be4b15a5ab9b6e0c4839df99dac7738aa7fe) Signed-off-by: Felix Cheung Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9f363a69 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9f363a69 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9f363a69 Branch: refs/heads/branch-2.0 Commit: 9f363a690102f04a2a486853c1b89134455518bc Parents: 4327337 Author: Junyang Qian Authored: Wed Aug 24 16:04:14 2016 -0700 Committer: Felix Cheung Committed: Wed Aug 24 16:04:26 2016 -0700 -- R/pkg/R/install.R | 64 ++ R/pkg/R/sparkR.R | 51 ++-- R/pkg/R/utils.R | 4 ++-- 3 files changed, 80 insertions(+), 39 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9f363a69/R/pkg/R/install.R -- diff --git a/R/pkg/R/install.R b/R/pkg/R/install.R index c6ed88e..69b0a52 100644 --- a/R/pkg/R/install.R +++ b/R/pkg/R/install.R @@ -70,9 +70,9 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl = NULL, localDir = NULL, overwrite = FALSE) { version <- paste0("spark-", packageVersion("SparkR")) hadoopVersion <- tolower(hadoopVersion) - hadoopVersionName <- hadoop_version_name(hadoopVersion) + hadoopVersionName <- hadoopVersionName(hadoopVersion) packageName <- paste(version, "bin", hadoopVersionName, sep = "-") - localDir <- ifelse(is.null(localDir), spark_cache_path(), + localDir <- ifelse(is.null(localDir), sparkCachePath(), normalizePath(localDir, mustWork = FALSE)) if (is.na(file.info(localDir)$isdir)) { @@ -88,12 +88,14 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl = NULL, # can use dir.exists(packageLocalDir) under R 3.2.0 or later if (!is.na(file.info(packageLocalDir)$isdir) && !overwrite) { -fmt <- "Spark %s for Hadoop %s is found, and SPARK_HOME set to %s" +fmt <- "%s for Hadoop %s found, with SPARK_HOME set to %s" msg <- sprintf(fmt, version, ifelse(hadoopVersion == "without", "Free build", hadoopVersion), packageLocalDir) message(msg) Sys.setenv(SPARK_HOME = packageLocalDir) return(invisible(packageLocalDir)) + } else { +message("Spark not found in the cache directory. Installation will start.") } packageLocalPath <- paste0(packageLocalDir, ".tgz") @@ -102,7 +104,7 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl = NULL, if (tarExists && !overwrite) { message("tar file found.") } else { -robust_download_tar(mirrorUrl, version, hadoopVersion, packageName, packageLocalPath) +robustDownloadTar(mirrorUrl, version, hadoopVersion, packageName, packageLocalPath) } message(sprintf("Installing to %s", localDir)) @@ -116,33 +118,37 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl = NULL, invisible(packageLocalDir) } -robust_download_tar <- function(mirrorUrl, version, hadoopVersion, packageName, packageLocalPath) { +robustDownloadTar <- function(mirrorUrl, version, hadoopVersion, packageName, packageLocalPath) { # step 1: use user-provided url if (!is.null(mirrorUrl)) { msg <- sprintf("Use user-provided mirror site: %s.", mirrorUrl) message(msg) -success <- direct_download_tar(mirrorUrl, version, hadoopVersion, +success <- directDownloadTar(mirrorUrl, version, hadoopVersion, packageName, packageLocalPath) -if (success) return() +if (success) { + return() +} else { + message(paste0("Unable to download from mirrorUrl: ", mirrorUrl)) +} } else {
spark git commit: [SPARKR][MINOR] Add more examples to window function docs
Repository: spark Updated Branches: refs/heads/master 945c04bcd -> 18708f76c [SPARKR][MINOR] Add more examples to window function docs ## What changes were proposed in this pull request? This PR adds more examples to window function docs to make them more accessible to the users. It also fixes default value issues for `lag` and `lead`. ## How was this patch tested? Manual test, R unit test. Author: Junyang QianCloses #14779 from junyangq/SPARKR-FixWindowFunctionDocs. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/18708f76 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/18708f76 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/18708f76 Branch: refs/heads/master Commit: 18708f76c366c6e01b5865981666e40d8642ac20 Parents: 945c04b Author: Junyang Qian Authored: Wed Aug 24 16:00:04 2016 -0700 Committer: Felix Cheung Committed: Wed Aug 24 16:00:04 2016 -0700 -- R/pkg/R/WindowSpec.R | 12 R/pkg/R/functions.R | 78 --- 2 files changed, 72 insertions(+), 18 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/18708f76/R/pkg/R/WindowSpec.R -- diff --git a/R/pkg/R/WindowSpec.R b/R/pkg/R/WindowSpec.R index ddd2ef2..4ac83c2 100644 --- a/R/pkg/R/WindowSpec.R +++ b/R/pkg/R/WindowSpec.R @@ -203,6 +203,18 @@ setMethod("rangeBetween", #' @aliases over,Column,WindowSpec-method #' @family colum_func #' @export +#' @examples \dontrun{ +#' df <- createDataFrame(mtcars) +#' +#' # Partition by am (transmission) and order by hp (horsepower) +#' ws <- orderBy(windowPartitionBy("am"), "hp") +#' +#' # Rank on hp within each partition +#' out <- select(df, over(rank(), ws), df$hp, df$am) +#' +#' # Lag mpg values by 1 row on the partition-and-ordered table +#' out <- select(df, over(lead(df$mpg), ws), df$mpg, df$hp, df$am) +#' } #' @note over since 2.0.0 setMethod("over", signature(x = "Column", window = "WindowSpec"), http://git-wip-us.apache.org/repos/asf/spark/blob/18708f76/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index f042add..dbf8dd8 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -3121,9 +3121,9 @@ setMethod("ifelse", #' @aliases cume_dist,missing-method #' @export #' @examples \dontrun{ -#' df <- createDataFrame(iris) -#' ws <- orderBy(windowPartitionBy("Species"), "Sepal_Length") -#' out <- select(df, over(cume_dist(), ws), df$Sepal_Length, df$Species) +#' df <- createDataFrame(mtcars) +#' ws <- orderBy(windowPartitionBy("am"), "hp") +#' out <- select(df, over(cume_dist(), ws), df$hp, df$am) #' } #' @note cume_dist since 1.6.0 setMethod("cume_dist", @@ -3148,7 +3148,11 @@ setMethod("cume_dist", #' @family window_funcs #' @aliases dense_rank,missing-method #' @export -#' @examples \dontrun{dense_rank()} +#' @examples \dontrun{ +#' df <- createDataFrame(mtcars) +#' ws <- orderBy(windowPartitionBy("am"), "hp") +#' out <- select(df, over(dense_rank(), ws), df$hp, df$am) +#' } #' @note dense_rank since 1.6.0 setMethod("dense_rank", signature("missing"), @@ -3168,18 +3172,26 @@ setMethod("dense_rank", #' @param x the column as a character string or a Column to compute on. #' @param offset the number of rows back from the current row from which to obtain a value. #' If not specified, the default is 1. -#' @param defaultValue default to use when the offset row does not exist. +#' @param defaultValue (optional) default to use when the offset row does not exist. #' @param ... further arguments to be passed to or from other methods. #' @rdname lag #' @name lag #' @aliases lag,characterOrColumn-method #' @family window_funcs #' @export -#' @examples \dontrun{lag(df$c)} +#' @examples \dontrun{ +#' df <- createDataFrame(mtcars) +#' +#' # Partition by am (transmission) and order by hp (horsepower) +#' ws <- orderBy(windowPartitionBy("am"), "hp") +#' +#' # Lag mpg values by 1 row on the partition-and-ordered table +#' out <- select(df, over(lag(df$mpg), ws), df$mpg, df$hp, df$am) +#' } #' @note lag since 1.6.0 setMethod("lag", signature(x = "characterOrColumn"), - function(x, offset, defaultValue = NULL) { + function(x, offset = 1, defaultValue = NULL) { col <- if (class(x) == "Column") { x@jc } else { @@ -3194,25 +3206,35 @@ setMethod("lag", #' lead #' #' Window function: returns the value that is \code{offset} rows after the current row, and -#' NULL if there is less than \code{offset} rows after the
spark git commit: [MINOR][SPARKR] fix R MLlib parameter documentation
Repository: spark Updated Branches: refs/heads/master 29952ed09 -> 945c04bcd [MINOR][SPARKR] fix R MLlib parameter documentation ## What changes were proposed in this pull request? Fixed several misplaced param tag - they should be on the spark.* method generics ## How was this patch tested? run knitr junyangq Author: Felix CheungCloses #14792 from felixcheung/rdocmllib. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/945c04bc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/945c04bc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/945c04bc Branch: refs/heads/master Commit: 945c04bcd439e0624232c040df529f12bcc05e13 Parents: 29952ed Author: Felix Cheung Authored: Wed Aug 24 15:59:09 2016 -0700 Committer: Felix Cheung Committed: Wed Aug 24 15:59:09 2016 -0700 -- R/pkg/R/mllib.R | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/945c04bc/R/pkg/R/mllib.R -- diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R index a670600..dfc5a1c 100644 --- a/R/pkg/R/mllib.R +++ b/R/pkg/R/mllib.R @@ -444,6 +444,7 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), #' @param featureIndex The index of the feature if \code{featuresCol} is a vector column #' (default: 0), no effect otherwise #' @param weightCol The weight column name. +#' @param ... additional arguments passed to the method. #' @return \code{spark.isoreg} returns a fitted Isotonic Regression model #' @rdname spark.isoreg #' @aliases spark.isoreg,SparkDataFrame,formula-method @@ -504,7 +505,6 @@ setMethod("predict", signature(object = "IsotonicRegressionModel"), # Get the summary of an IsotonicRegressionModel model -#' @param ... Other optional arguments to summary of an IsotonicRegressionModel #' @return \code{summary} returns the model's boundaries and prediction as lists #' @rdname spark.isoreg #' @aliases summary,IsotonicRegressionModel-method @@ -1074,6 +1074,7 @@ setMethod("predict", signature(object = "AFTSurvivalRegressionModel"), #' @param k number of independent Gaussians in the mixture model. #' @param maxIter maximum iteration number. #' @param tol the convergence tolerance. +#' @param ... additional arguments passed to the method. #' @aliases spark.gaussianMixture,SparkDataFrame,formula-method #' @return \code{spark.gaussianMixture} returns a fitted multivariate gaussian mixture model. #' @rdname spark.gaussianMixture @@ -1117,7 +1118,6 @@ setMethod("spark.gaussianMixture", signature(data = "SparkDataFrame", formula = # Get the summary of a multivariate gaussian mixture model #' @param object a fitted gaussian mixture model. -#' @param ... currently not used argument(s) passed to the method. #' @return \code{summary} returns the model's lambda, mu, sigma and posterior. #' @aliases spark.gaussianMixture,SparkDataFrame,formula-method #' @rdname spark.gaussianMixture - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and dateFormat/timestampFormat option for CSV and JSON
Repository: spark Updated Branches: refs/heads/master 891ac2b91 -> 29952ed09 [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and dateFormat/timestampFormat option for CSV and JSON ## What changes were proposed in this pull request? ### Default - ISO 8601 Currently, CSV datasource is writing `Timestamp` and `Date` as numeric form and JSON datasource is writing both as below: - CSV ``` // TimestampType 14144598 // DateType 16673 ``` - Json ``` // TimestampType 1970-01-01 11:46:40.0 // DateType 1970-01-01 ``` So, for CSV we can't read back what we write and for JSON it becomes ambiguous because the timezone is being missed. So, this PR make both **write** `Timestamp` and `Date` in ISO 8601 formatted string (please refer the [ISO 8601 specification](https://www.w3.org/TR/NOTE-datetime)). - For `Timestamp` it becomes as below: (`-MM-dd'T'HH:mm:ss.SSSZZ`) ``` 1970-01-01T02:00:01.000-01:00 ``` - For `Date` it becomes as below (`-MM-dd`) ``` 1970-01-01 ``` ### Custom date format option - `dateFormat` This PR also adds the support to write and read dates and timestamps in a formatted string as below: - **DateType** - With `dateFormat` option (e.g. `/MM/dd`) ``` +--+ | date| +--+ |2015/08/26| |2014/10/27| |2016/01/28| +--+ ``` ### Custom date format option - `timestampFormat` - **TimestampType** - With `dateFormat` option (e.g. `dd/MM/ HH:mm`) ``` ++ |date| ++ |2015/08/26 18:00| |2014/10/27 18:30| |2016/01/28 20:00| ++ ``` ## How was this patch tested? Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests cover the default cases. Author: hyukjinkwonCloses #14279 from HyukjinKwon/SPARK-16216-json-csv. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/29952ed0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/29952ed0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/29952ed0 Branch: refs/heads/master Commit: 29952ed096fd2a0a19079933ff691671d6f00835 Parents: 891ac2b Author: hyukjinkwon Authored: Wed Aug 24 22:16:20 2016 +0200 Committer: Herman van Hovell Committed: Wed Aug 24 22:16:20 2016 +0200 -- python/pyspark/sql/readwriter.py| 56 +-- python/pyspark/sql/streaming.py | 30 +++- .../org/apache/spark/sql/DataFrameReader.scala | 18 ++- .../org/apache/spark/sql/DataFrameWriter.scala | 12 ++ .../datasources/csv/CSVInferSchema.scala| 42 ++--- .../execution/datasources/csv/CSVOptions.scala | 15 +- .../execution/datasources/csv/CSVRelation.scala | 43 - .../datasources/json/JSONOptions.scala | 9 ++ .../datasources/json/JacksonGenerator.scala | 13 +- .../datasources/json/JacksonParser.scala| 27 +++- .../datasources/json/JsonFileFormat.scala | 5 +- .../spark/sql/streaming/DataStreamReader.scala | 19 ++- .../datasources/csv/CSVInferSchemaSuite.scala | 4 +- .../execution/datasources/csv/CSVSuite.scala| 157 ++- .../datasources/csv/CSVTypeCastSuite.scala | 17 +- .../execution/datasources/json/JsonSuite.scala | 67 +++- .../datasources/json/TestJsonData.scala | 6 + .../sql/sources/JsonHadoopFsRelationSuite.scala | 4 + 18 files changed, 454 insertions(+), 90 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/29952ed0/python/pyspark/sql/readwriter.py -- diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 64de33e..3da6f49 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -156,7 +156,7 @@ class DataFrameReader(OptionUtils): def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, allowComments=None, allowUnquotedFieldNames=None, allowSingleQuotes=None, allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None, - mode=None, columnNameOfCorruptRecord=None): + mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None): """ Loads a JSON file (one object per line) or an RDD of Strings storing JSON objects (one object per record) and returns the result as a :class`DataFrame`. @@ -198,6 +198,14 @@ class DataFrameReader(OptionUtils): ``spark.sql.columnNameOfCorruptRecord``. If None is set, it uses the value specified in
spark git commit: [SPARK-15083][WEB UI] History Server can OOM due to unlimited TaskUIData
Repository: spark Updated Branches: refs/heads/master 40b30fcf4 -> 891ac2b91 [SPARK-15083][WEB UI] History Server can OOM due to unlimited TaskUIData ## What changes were proposed in this pull request? Based on #12990 by tankkyo Since the History Server currently loads all application's data it can OOM if too many applications have a significant task count. `spark.ui.trimTasks` (default: false) can be set to true to trim tasks by `spark.ui.retainedTasks` (default: 1) (This is a "quick fix" to help those running into the problem until a update of how the history server loads app data can be done) ## How was this patch tested? Manual testing and dev/run-tests ![spark-15083](https://cloud.githubusercontent.com/assets/13952758/17713694/fe82d246-63b0-11e6-9697-b87ea75ff4ef.png) Author: Alex BozarthCloses #14673 from ajbozarth/spark15083. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/891ac2b9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/891ac2b9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/891ac2b9 Branch: refs/heads/master Commit: 891ac2b914fb6f90a62c6fbc0a3960a89d1c1d92 Parents: 40b30fc Author: Alex Bozarth Authored: Wed Aug 24 14:39:41 2016 -0500 Committer: Tom Graves Committed: Wed Aug 24 14:39:41 2016 -0500 -- .../apache/spark/internal/config/package.scala | 5 + .../spark/ui/jobs/JobProgressListener.scala | 9 +- .../org/apache/spark/ui/jobs/StagePage.scala| 12 +- .../scala/org/apache/spark/ui/jobs/UIData.scala | 4 +- .../stage_task_list_w__sortBy_expectation.json | 130 ++--- ...ortBy_short_names___runtime_expectation.json | 130 ++--- ...sortBy_short_names__runtime_expectation.json | 182 +-- .../status/api/v1/AllStagesResourceSuite.scala | 4 +- docs/configuration.md | 8 + 9 files changed, 256 insertions(+), 228 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/891ac2b9/core/src/main/scala/org/apache/spark/internal/config/package.scala -- diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala index be3dac4..47174e4 100644 --- a/core/src/main/scala/org/apache/spark/internal/config/package.scala +++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala @@ -114,4 +114,9 @@ package object config { private[spark] val PYSPARK_PYTHON = ConfigBuilder("spark.pyspark.python") .stringConf .createOptional + + // To limit memory usage, we only track information for a fixed number of tasks + private[spark] val UI_RETAINED_TASKS = ConfigBuilder("spark.ui.retainedTasks") +.intConf +.createWithDefault(10) } http://git-wip-us.apache.org/repos/asf/spark/blob/891ac2b9/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala -- diff --git a/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala b/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala index 491f716..d3a4f9d 100644 --- a/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala +++ b/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala @@ -19,12 +19,13 @@ package org.apache.spark.ui.jobs import java.util.concurrent.TimeoutException -import scala.collection.mutable.{HashMap, HashSet, ListBuffer} +import scala.collection.mutable.{HashMap, HashSet, LinkedHashMap, ListBuffer} import org.apache.spark._ import org.apache.spark.annotation.DeveloperApi import org.apache.spark.executor.TaskMetrics import org.apache.spark.internal.Logging +import org.apache.spark.internal.config._ import org.apache.spark.scheduler._ import org.apache.spark.scheduler.SchedulingMode.SchedulingMode import org.apache.spark.storage.BlockManagerId @@ -93,6 +94,7 @@ class JobProgressListener(conf: SparkConf) extends SparkListener with Logging { val retainedStages = conf.getInt("spark.ui.retainedStages", SparkUI.DEFAULT_RETAINED_STAGES) val retainedJobs = conf.getInt("spark.ui.retainedJobs", SparkUI.DEFAULT_RETAINED_JOBS) + val retainedTasks = conf.get(UI_RETAINED_TASKS) // We can test for memory leaks by ensuring that collections that track non-active jobs and // stages do not grow without bound and that collections for active jobs/stages eventually become @@ -405,6 +407,11 @@ class JobProgressListener(conf: SparkConf) extends SparkListener with Logging { taskData.updateTaskMetrics(taskMetrics) taskData.errorMessage = errorMessage + // If Tasks is too large, remove
spark git commit: [SPARK-16983][SQL] Add `prettyName` for row_number, dense_rank, percent_rank, cume_dist
Repository: spark Updated Branches: refs/heads/master 0b3a4be92 -> 40b30fcf4 [SPARK-16983][SQL] Add `prettyName` for row_number, dense_rank, percent_rank, cume_dist ## What changes were proposed in this pull request? Currently, two-word window functions like `row_number`, `dense_rank`, `percent_rank`, and `cume_dist` are expressed without `_` in error messages. We had better show the correct names. **Before** ```scala scala> sql("select row_number()").show java.lang.UnsupportedOperationException: Cannot evaluate expression: rownumber() ``` **After** ```scala scala> sql("select row_number()").show java.lang.UnsupportedOperationException: Cannot evaluate expression: row_number() ``` ## How was this patch tested? Pass the Jenkins and manual. Author: Dongjoon HyunCloses #14571 from dongjoon-hyun/SPARK-16983. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/40b30fcf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/40b30fcf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/40b30fcf Branch: refs/heads/master Commit: 40b30fcf453169534cb53d01cd22236210b13005 Parents: 0b3a4be Author: Dongjoon Hyun Authored: Wed Aug 24 21:14:40 2016 +0200 Committer: Herman van Hovell Committed: Wed Aug 24 21:14:40 2016 +0200 -- .../sql/catalyst/expressions/windowExpressions.scala | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/40b30fcf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index 6806591..b47486f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -477,7 +477,7 @@ object SizeBasedWindowFunction { the window partition.""") case class RowNumber() extends RowNumberLike { override val evaluateExpression = rowNumber - override def sql: String = "ROW_NUMBER()" + override def prettyName: String = "row_number" } /** @@ -497,7 +497,7 @@ case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction { // return the same value for equal values in the partition. override val frame = SpecifiedWindowFrame(RangeFrame, UnboundedPreceding, CurrentRow) override val evaluateExpression = Divide(Cast(rowNumber, DoubleType), Cast(n, DoubleType)) - override def sql: String = "CUME_DIST()" + override def prettyName: String = "cume_dist" } /** @@ -628,6 +628,8 @@ abstract class RankLike extends AggregateWindowFunction { override val updateExpressions = increaseRank +: increaseRowNumber +: children override val evaluateExpression: Expression = rank + override def sql: String = s"${prettyName.toUpperCase}()" + def withOrder(order: Seq[Expression]): RankLike } @@ -649,7 +651,6 @@ abstract class RankLike extends AggregateWindowFunction { case class Rank(children: Seq[Expression]) extends RankLike { def this() = this(Nil) override def withOrder(order: Seq[Expression]): Rank = Rank(order) - override def sql: String = "RANK()" } /** @@ -674,7 +675,7 @@ case class DenseRank(children: Seq[Expression]) extends RankLike { override val updateExpressions = increaseRank +: children override val aggBufferAttributes = rank +: orderAttrs override val initialValues = zero +: orderInit - override def sql: String = "DENSE_RANK()" + override def prettyName: String = "dense_rank" } /** @@ -701,5 +702,5 @@ case class PercentRank(children: Seq[Expression]) extends RankLike with SizeBase override val evaluateExpression = If(GreaterThan(n, one), Divide(Cast(Subtract(rank, one), DoubleType), Cast(Subtract(n, one), DoubleType)), Literal(0.0d)) - override def sql: String = "PERCENT_RANK()" + override def prettyName: String = "percent_rank" } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment
Repository: spark Updated Branches: refs/heads/branch-2.0 29091d7cd -> 9f924a01b [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment ## What changes were proposed in this pull request? Update to py4j 0.10.3 to enable JAVA_HOME support ## How was this patch tested? Pyspark tests Author: Sean OwenCloses #14748 from srowen/SPARK-16781. (cherry picked from commit 0b3a4be92ca6b38eef32ea5ca240d9f91f68aa65) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9f924a01 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9f924a01 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9f924a01 Branch: refs/heads/branch-2.0 Commit: 9f924a01b27ebba56080c9ad01b84fff026d5dcd Parents: 29091d7 Author: Sean Owen Authored: Wed Aug 24 20:04:09 2016 +0100 Committer: Sean Owen Committed: Wed Aug 24 20:04:20 2016 +0100 -- LICENSE| 2 +- bin/pyspark| 2 +- bin/pyspark2.cmd | 2 +- core/pom.xml | 2 +- .../org/apache/spark/api/python/PythonUtils.scala | 2 +- dev/deps/spark-deps-hadoop-2.2 | 2 +- dev/deps/spark-deps-hadoop-2.3 | 2 +- dev/deps/spark-deps-hadoop-2.4 | 2 +- dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- python/docs/Makefile | 2 +- python/lib/py4j-0.10.1-src.zip | Bin 61356 -> 0 bytes python/lib/py4j-0.10.3-src.zip | Bin 0 -> 91275 bytes sbin/spark-config.sh | 2 +- .../org/apache/spark/deploy/yarn/Client.scala | 6 +++--- .../spark/deploy/yarn/YarnClusterSuite.scala | 2 +- 16 files changed, 16 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9f924a01/LICENSE -- diff --git a/LICENSE b/LICENSE index 94fd46f..d68609c 100644 --- a/LICENSE +++ b/LICENSE @@ -263,7 +263,7 @@ The text of each license is also included at licenses/LICENSE-[project].txt. (New BSD license) Protocol Buffer Java API (org.spark-project.protobuf:protobuf-java:2.4.1-shaded - http://code.google.com/p/protobuf) (The BSD License) Fortran to Java ARPACK (net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net) (The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - http://xmlenc.sourceforge.net) - (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.1 - http://py4j.sourceforge.net/) + (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.3 - http://py4j.sourceforge.net/) (Two-clause BSD-style license) JUnit-Interface (com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/) (BSD licence) sbt and sbt-launch-lib.bash (BSD 3 Clause) d3.min.js (https://github.com/mbostock/d3/blob/master/LICENSE) http://git-wip-us.apache.org/repos/asf/spark/blob/9f924a01/bin/pyspark -- diff --git a/bin/pyspark b/bin/pyspark index ac8aa04..037645d 100755 --- a/bin/pyspark +++ b/bin/pyspark @@ -65,7 +65,7 @@ export PYSPARK_PYTHON # Add the PySpark classes to the Python path: export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH" -export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.1-src.zip:$PYTHONPATH" +export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH" # Load the PySpark shell.py script when ./pyspark is used interactively: export OLD_PYTHONSTARTUP="$PYTHONSTARTUP" http://git-wip-us.apache.org/repos/asf/spark/blob/9f924a01/bin/pyspark2.cmd -- diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd index 3e2ff10..1217a4f 100644 --- a/bin/pyspark2.cmd +++ b/bin/pyspark2.cmd @@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" ( ) set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH% -set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.1-src.zip;%PYTHONPATH% +set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.3-src.zip;%PYTHONPATH% set OLD_PYTHONSTARTUP=%PYTHONSTARTUP% set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py http://git-wip-us.apache.org/repos/asf/spark/blob/9f924a01/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index bb27ec9..208659b 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -327,7 +327,7 @@ net.sf.py4j
spark git commit: [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment
Repository: spark Updated Branches: refs/heads/master 2fbdb6063 -> 0b3a4be92 [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment ## What changes were proposed in this pull request? Update to py4j 0.10.3 to enable JAVA_HOME support ## How was this patch tested? Pyspark tests Author: Sean OwenCloses #14748 from srowen/SPARK-16781. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0b3a4be9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0b3a4be9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0b3a4be9 Branch: refs/heads/master Commit: 0b3a4be92ca6b38eef32ea5ca240d9f91f68aa65 Parents: 2fbdb60 Author: Sean Owen Authored: Wed Aug 24 20:04:09 2016 +0100 Committer: Sean Owen Committed: Wed Aug 24 20:04:09 2016 +0100 -- LICENSE| 2 +- bin/pyspark| 2 +- bin/pyspark2.cmd | 2 +- core/pom.xml | 2 +- .../org/apache/spark/api/python/PythonUtils.scala | 2 +- dev/deps/spark-deps-hadoop-2.2 | 2 +- dev/deps/spark-deps-hadoop-2.3 | 2 +- dev/deps/spark-deps-hadoop-2.4 | 2 +- dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- python/docs/Makefile | 2 +- python/lib/py4j-0.10.1-src.zip | Bin 61356 -> 0 bytes python/lib/py4j-0.10.3-src.zip | Bin 0 -> 91275 bytes sbin/spark-config.sh | 2 +- .../org/apache/spark/deploy/yarn/Client.scala | 6 +++--- .../spark/deploy/yarn/YarnClusterSuite.scala | 2 +- 16 files changed, 16 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0b3a4be9/LICENSE -- diff --git a/LICENSE b/LICENSE index 94fd46f..d68609c 100644 --- a/LICENSE +++ b/LICENSE @@ -263,7 +263,7 @@ The text of each license is also included at licenses/LICENSE-[project].txt. (New BSD license) Protocol Buffer Java API (org.spark-project.protobuf:protobuf-java:2.4.1-shaded - http://code.google.com/p/protobuf) (The BSD License) Fortran to Java ARPACK (net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net) (The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - http://xmlenc.sourceforge.net) - (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.1 - http://py4j.sourceforge.net/) + (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.3 - http://py4j.sourceforge.net/) (Two-clause BSD-style license) JUnit-Interface (com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/) (BSD licence) sbt and sbt-launch-lib.bash (BSD 3 Clause) d3.min.js (https://github.com/mbostock/d3/blob/master/LICENSE) http://git-wip-us.apache.org/repos/asf/spark/blob/0b3a4be9/bin/pyspark -- diff --git a/bin/pyspark b/bin/pyspark index a0d7e22..7590309 100755 --- a/bin/pyspark +++ b/bin/pyspark @@ -57,7 +57,7 @@ export PYSPARK_PYTHON # Add the PySpark classes to the Python path: export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH" -export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.1-src.zip:$PYTHONPATH" +export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH" # Load the PySpark shell.py script when ./pyspark is used interactively: export OLD_PYTHONSTARTUP="$PYTHONSTARTUP" http://git-wip-us.apache.org/repos/asf/spark/blob/0b3a4be9/bin/pyspark2.cmd -- diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd index 3e2ff10..1217a4f 100644 --- a/bin/pyspark2.cmd +++ b/bin/pyspark2.cmd @@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" ( ) set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH% -set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.1-src.zip;%PYTHONPATH% +set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.3-src.zip;%PYTHONPATH% set OLD_PYTHONSTARTUP=%PYTHONSTARTUP% set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py http://git-wip-us.apache.org/repos/asf/spark/blob/0b3a4be9/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 04b94a2..ab6c3ce 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -326,7 +326,7 @@ net.sf.py4j py4j - 0.10.1 + 0.10.3 org.apache.spark
spark git commit: [SPARK-16445][MLLIB][SPARKR] Multilayer Perceptron Classifier wrapper in SparkR
Repository: spark Updated Branches: refs/heads/master d2932a0e9 -> 2fbdb6063 [SPARK-16445][MLLIB][SPARKR] Multilayer Perceptron Classifier wrapper in SparkR https://issues.apache.org/jira/browse/SPARK-16445 ## What changes were proposed in this pull request? Create Multilayer Perceptron Classifier wrapper in SparkR ## How was this patch tested? Tested manually on local machine Author: Xin RenCloses #14447 from keypointt/SPARK-16445. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2fbdb606 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2fbdb606 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2fbdb606 Branch: refs/heads/master Commit: 2fbdb606392631b1dff88ec86f388cc2559c28f5 Parents: d2932a0 Author: Xin Ren Authored: Wed Aug 24 11:18:10 2016 -0700 Committer: Felix Cheung Committed: Wed Aug 24 11:18:10 2016 -0700 -- R/pkg/NAMESPACE | 1 + R/pkg/R/generics.R | 4 + R/pkg/R/mllib.R | 125 - R/pkg/inst/tests/testthat/test_mllib.R | 32 + .../MultilayerPerceptronClassifierWrapper.scala | 134 +++ .../scala/org/apache/spark/ml/r/RWrappers.scala | 2 + 6 files changed, 293 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2fbdb606/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 7090576..ad587a6 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -27,6 +27,7 @@ exportMethods("glm", "summary", "spark.kmeans", "fitted", + "spark.mlp", "spark.naiveBayes", "spark.survreg", "spark.lda", http://git-wip-us.apache.org/repos/asf/spark/blob/2fbdb606/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index 4e6..7e626be 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -1330,6 +1330,10 @@ setGeneric("spark.kmeans", function(data, formula, ...) { standardGeneric("spark #' @export setGeneric("fitted") +#' @rdname spark.mlp +#' @export +setGeneric("spark.mlp", function(data, ...) { standardGeneric("spark.mlp") }) + #' @rdname spark.naiveBayes #' @export setGeneric("spark.naiveBayes", function(data, formula, ...) { standardGeneric("spark.naiveBayes") }) http://git-wip-us.apache.org/repos/asf/spark/blob/2fbdb606/R/pkg/R/mllib.R -- diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R index a40310d..a670600 100644 --- a/R/pkg/R/mllib.R +++ b/R/pkg/R/mllib.R @@ -60,6 +60,13 @@ setClass("AFTSurvivalRegressionModel", representation(jobj = "jobj")) #' @note KMeansModel since 2.0.0 setClass("KMeansModel", representation(jobj = "jobj")) +#' S4 class that represents a MultilayerPerceptronClassificationModel +#' +#' @param jobj a Java object reference to the backing Scala MultilayerPerceptronClassifierWrapper +#' @export +#' @note MultilayerPerceptronClassificationModel since 2.1.0 +setClass("MultilayerPerceptronClassificationModel", representation(jobj = "jobj")) + #' S4 class that represents an IsotonicRegressionModel #' #' @param jobj a Java object reference to the backing Scala IsotonicRegressionModel @@ -90,7 +97,7 @@ setClass("ALSModel", representation(jobj = "jobj")) #' @export #' @seealso \link{spark.glm}, \link{glm}, #' @seealso \link{spark.als}, \link{spark.gaussianMixture}, \link{spark.isoreg}, \link{spark.kmeans}, -#' @seealso \link{spark.lda}, \link{spark.naiveBayes}, \link{spark.survreg}, +#' @seealso \link{spark.lda}, \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg} #' @seealso \link{read.ml} NULL @@ -103,7 +110,7 @@ NULL #' @export #' @seealso \link{spark.glm}, \link{glm}, #' @seealso \link{spark.als}, \link{spark.gaussianMixture}, \link{spark.isoreg}, \link{spark.kmeans}, -#' @seealso \link{spark.naiveBayes}, \link{spark.survreg}, +#' @seealso \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg} NULL write_internal <- function(object, path, overwrite = FALSE) { @@ -631,6 +638,95 @@ setMethod("predict", signature(object = "KMeansModel"), predict_internal(object, newData) }) +#' Multilayer Perceptron Classification Model +#' +#' \code{spark.mlp} fits a multi-layer perceptron neural network model against a SparkDataFrame. +#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted
spark git commit: [SPARKR][MINOR] Fix doc for show method
Repository: spark Updated Branches: refs/heads/branch-2.0 33d79b587 -> 29091d7cd [SPARKR][MINOR] Fix doc for show method ## What changes were proposed in this pull request? The original doc of `show` put methods for multiple classes together but the text only talks about `SparkDataFrame`. This PR tries to fix this problem. ## How was this patch tested? Manual test. Author: Junyang QianCloses #14776 from junyangq/SPARK-FixShowDoc. (cherry picked from commit d2932a0e987132c694ed59515b7c77adaad052e6) Signed-off-by: Felix Cheung Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/29091d7c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/29091d7c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/29091d7c Branch: refs/heads/branch-2.0 Commit: 29091d7cd60c20bf019dc9c1625a22e80ea50928 Parents: 33d79b5 Author: Junyang Qian Authored: Wed Aug 24 10:40:09 2016 -0700 Committer: Felix Cheung Committed: Wed Aug 24 10:40:26 2016 -0700 -- R/pkg/R/DataFrame.R | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/29091d7c/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index f8a05c6..ab45d2c 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -205,9 +205,9 @@ setMethod("showDF", #' show #' -#' Print the SparkDataFrame column names and types +#' Print class and type information of a Spark object. #' -#' @param object a SparkDataFrame. +#' @param object a Spark object. Can be a SparkDataFrame, Column, GroupedData, WindowSpec. #' #' @family SparkDataFrame functions #' @rdname show - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARKR][MINOR] Fix doc for show method
Repository: spark Updated Branches: refs/heads/master 45b786aca -> d2932a0e9 [SPARKR][MINOR] Fix doc for show method ## What changes were proposed in this pull request? The original doc of `show` put methods for multiple classes together but the text only talks about `SparkDataFrame`. This PR tries to fix this problem. ## How was this patch tested? Manual test. Author: Junyang QianCloses #14776 from junyangq/SPARK-FixShowDoc. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d2932a0e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d2932a0e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d2932a0e Branch: refs/heads/master Commit: d2932a0e987132c694ed59515b7c77adaad052e6 Parents: 45b786a Author: Junyang Qian Authored: Wed Aug 24 10:40:09 2016 -0700 Committer: Felix Cheung Committed: Wed Aug 24 10:40:09 2016 -0700 -- R/pkg/R/DataFrame.R | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d2932a0e/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 52a6628..e12b58e 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -212,9 +212,9 @@ setMethod("showDF", #' show #' -#' Print the SparkDataFrame column names and types +#' Print class and type information of a Spark object. #' -#' @param object a SparkDataFrame. +#' @param object a Spark object. Can be a SparkDataFrame, Column, GroupedData, WindowSpec. #' #' @family SparkDataFrame functions #' @rdname show - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][DOC] Fix wrong ml.feature.Normalizer document.
Repository: spark Updated Branches: refs/heads/master 92c0eaf34 -> 45b786aca [MINOR][DOC] Fix wrong ml.feature.Normalizer document. ## What changes were proposed in this pull request? The ```ml.feature.Normalizer``` examples illustrate L1 norm rather than L2, we should correct corresponding document. ![image](https://cloud.githubusercontent.com/assets/1962026/17928637/85aec284-69b0-11e6-9b13-d465ee560581.png) ## How was this patch tested? Doc change, no test. Author: Yanbo LiangCloses #14787 from yanboliang/normalizer. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/45b786ac Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/45b786ac Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/45b786ac Branch: refs/heads/master Commit: 45b786aca2b5818dc233643e6b3a53b869560563 Parents: 92c0eaf Author: Yanbo Liang Authored: Wed Aug 24 08:24:16 2016 -0700 Committer: Yanbo Liang Committed: Wed Aug 24 08:24:16 2016 -0700 -- docs/ml-features.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/45b786ac/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index 6020114..e41bf78 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -734,7 +734,7 @@ for more details on the API. `Normalizer` is a `Transformer` which transforms a dataset of `Vector` rows, normalizing each `Vector` to have unit norm. It takes parameter `p`, which specifies the [p-norm](http://en.wikipedia.org/wiki/Norm_%28mathematics%29#p-norm) used for normalization. ($p = 2$ by default.) This normalization can help standardize your input data and improve the behavior of learning algorithms. -The following example demonstrates how to load a dataset in libsvm format and then normalize each row to have unit $L^2$ norm and unit $L^\infty$ norm. +The following example demonstrates how to load a dataset in libsvm format and then normalize each row to have unit $L^1$ norm and unit $L^\infty$ norm. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer when some quantiles are duplicated
Repository: spark Updated Branches: refs/heads/branch-2.0 ce7dce175 -> 33d79b587 [SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer when some quantiles are duplicated ## What changes were proposed in this pull request? In cases when QuantileDiscretizerSuite is called upon a numeric array with duplicated elements, we will take the unique elements generated from approxQuantiles as input for Bucketizer. ## How was this patch tested? An unit test is added in QuantileDiscretizerSuite QuantileDiscretizer.fit will throw an illegal exception when calling setSplits on a list of splits with duplicated elements. Bucketizer.setSplits should only accept either a numeric vector of two or more unique cut points, although that may produce less number of buckets than requested. Signed-off-by: VinceShieh Author: VinceShiehCloses #14747 from VinceShieh/SPARK-17086. (cherry picked from commit 92c0eaf348b42b3479610da0be761013f9d81c54) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/33d79b58 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/33d79b58 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/33d79b58 Branch: refs/heads/branch-2.0 Commit: 33d79b58735770ac613540c21095a1e404f065b0 Parents: ce7dce1 Author: VinceShieh Authored: Wed Aug 24 10:16:58 2016 +0100 Committer: Sean Owen Committed: Wed Aug 24 13:46:40 2016 +0100 -- .../spark/ml/feature/QuantileDiscretizer.scala | 7 ++- .../ml/feature/QuantileDiscretizerSuite.scala| 19 +++ 2 files changed, 25 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/33d79b58/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala index 558a7bb..e098008 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala @@ -114,7 +114,12 @@ final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override val ui splits(0) = Double.NegativeInfinity splits(splits.length - 1) = Double.PositiveInfinity -val bucketizer = new Bucketizer(uid).setSplits(splits) +val distinctSplits = splits.distinct +if (splits.length != distinctSplits.length) { + log.warn(s"Some quantiles were identical. Bucketing to ${distinctSplits.length - 1}" + +s" buckets as a result.") +} +val bucketizer = new Bucketizer(uid).setSplits(distinctSplits.sorted) copyValues(bucketizer.setParent(this)) } http://git-wip-us.apache.org/repos/asf/spark/blob/33d79b58/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala index b73dbd6..18f1e89 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala @@ -52,6 +52,25 @@ class QuantileDiscretizerSuite "Bucket sizes are not within expected relative error tolerance.") } + test("Test Bucketizer on duplicated splits") { +val spark = this.spark +import spark.implicits._ + +val datasetSize = 12 +val numBuckets = 5 +val df = sc.parallelize(Array(1.0, 3.0, 2.0, 1.0, 1.0, 2.0, 3.0, 2.0, 2.0, 2.0, 1.0, 3.0)) + .map(Tuple1.apply).toDF("input") +val discretizer = new QuantileDiscretizer() + .setInputCol("input") + .setOutputCol("result") + .setNumBuckets(numBuckets) +val result = discretizer.fit(df).transform(df) + +val observedNumBuckets = result.select("result").distinct.count +assert(2 <= observedNumBuckets && observedNumBuckets <= numBuckets, + "Observed number of buckets are not within expected range.") + } + test("Test transform method on unseen data") { val spark = this.spark import spark.implicits._ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer when some quantiles are duplicated
Repository: spark Updated Branches: refs/heads/master 673a80d22 -> 92c0eaf34 [SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer when some quantiles are duplicated ## What changes were proposed in this pull request? In cases when QuantileDiscretizerSuite is called upon a numeric array with duplicated elements, we will take the unique elements generated from approxQuantiles as input for Bucketizer. ## How was this patch tested? An unit test is added in QuantileDiscretizerSuite QuantileDiscretizer.fit will throw an illegal exception when calling setSplits on a list of splits with duplicated elements. Bucketizer.setSplits should only accept either a numeric vector of two or more unique cut points, although that may produce less number of buckets than requested. Signed-off-by: VinceShieh Author: VinceShiehCloses #14747 from VinceShieh/SPARK-17086. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/92c0eaf3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/92c0eaf3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/92c0eaf3 Branch: refs/heads/master Commit: 92c0eaf348b42b3479610da0be761013f9d81c54 Parents: 673a80d Author: VinceShieh Authored: Wed Aug 24 10:16:58 2016 +0100 Committer: Sean Owen Committed: Wed Aug 24 10:16:58 2016 +0100 -- .../spark/ml/feature/QuantileDiscretizer.scala | 7 ++- .../ml/feature/QuantileDiscretizerSuite.scala| 19 +++ 2 files changed, 25 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/92c0eaf3/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala index 558a7bb..e098008 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala @@ -114,7 +114,12 @@ final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override val ui splits(0) = Double.NegativeInfinity splits(splits.length - 1) = Double.PositiveInfinity -val bucketizer = new Bucketizer(uid).setSplits(splits) +val distinctSplits = splits.distinct +if (splits.length != distinctSplits.length) { + log.warn(s"Some quantiles were identical. Bucketing to ${distinctSplits.length - 1}" + +s" buckets as a result.") +} +val bucketizer = new Bucketizer(uid).setSplits(distinctSplits.sorted) copyValues(bucketizer.setParent(this)) } http://git-wip-us.apache.org/repos/asf/spark/blob/92c0eaf3/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala index b73dbd6..18f1e89 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala @@ -52,6 +52,25 @@ class QuantileDiscretizerSuite "Bucket sizes are not within expected relative error tolerance.") } + test("Test Bucketizer on duplicated splits") { +val spark = this.spark +import spark.implicits._ + +val datasetSize = 12 +val numBuckets = 5 +val df = sc.parallelize(Array(1.0, 3.0, 2.0, 1.0, 1.0, 2.0, 3.0, 2.0, 2.0, 2.0, 1.0, 3.0)) + .map(Tuple1.apply).toDF("input") +val discretizer = new QuantileDiscretizer() + .setInputCol("input") + .setOutputCol("result") + .setNumBuckets(numBuckets) +val result = discretizer.fit(df).transform(df) + +val observedNumBuckets = result.select("result").distinct.count +assert(2 <= observedNumBuckets && observedNumBuckets <= numBuckets, + "Observed number of buckets are not within expected range.") + } + test("Test transform method on unseen data") { val spark = this.spark import spark.implicits._ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][BUILD] Fix Java CheckStyle Error
Repository: spark Updated Branches: refs/heads/branch-2.0 df87f161c -> ce7dce175 [MINOR][BUILD] Fix Java CheckStyle Error As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), besides the critical bugs, it's better to fix the code style errors before the release. Before: ``` ./dev/lint-java Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525] (sizes) LineLength: Line is longer than 100 characters (found 119). [ERROR] src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103). ``` After: ``` ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Manual. Author: Weiqing YangCloses #14768 from Sherry302/fixjavastyle. (cherry picked from commit 673a80d2230602c9e6573a23e35fb0f6b832bfca) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ce7dce17 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ce7dce17 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ce7dce17 Branch: refs/heads/branch-2.0 Commit: ce7dce1755a8d36ec7346adc3de26d8fdc4f05e9 Parents: df87f16 Author: Weiqing Yang Authored: Wed Aug 24 10:12:44 2016 +0100 Committer: Sean Owen Committed: Wed Aug 24 10:15:53 2016 +0100 -- .../spark/util/collection/unsafe/sort/UnsafeExternalSorter.java | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ce7dce17/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java -- diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java index 0d67167..999ded4 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java @@ -521,7 +521,8 @@ public final class UnsafeExternalSorter extends MemoryConsumer { // is accessing the current record. We free this page in that caller's next loadNext() // call. for (MemoryBlock page : allocatedPages) { -if (!loaded || page.pageNumber != ((UnsafeInMemorySorter.SortedIterator)upstream).getCurrentPageNumber()) { +if (!loaded || page.pageNumber != + ((UnsafeInMemorySorter.SortedIterator)upstream).getCurrentPageNumber()) { released += page.size(); freePage(page); } else { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][BUILD] Fix Java CheckStyle Error
Repository: spark Updated Branches: refs/heads/master 52fa45d62 -> 673a80d22 [MINOR][BUILD] Fix Java CheckStyle Error ## What changes were proposed in this pull request? As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), besides the critical bugs, it's better to fix the code style errors before the release. Before: ``` ./dev/lint-java Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525] (sizes) LineLength: Line is longer than 100 characters (found 119). [ERROR] src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103). ``` After: ``` ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` ## How was this patch tested? Manual. Author: Weiqing YangCloses #14768 from Sherry302/fixjavastyle. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/673a80d2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/673a80d2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/673a80d2 Branch: refs/heads/master Commit: 673a80d2230602c9e6573a23e35fb0f6b832bfca Parents: 52fa45d Author: Weiqing Yang Authored: Wed Aug 24 10:12:44 2016 +0100 Committer: Sean Owen Committed: Wed Aug 24 10:12:44 2016 +0100 -- .../collection/unsafe/sort/UnsafeExternalSorter.java | 3 ++- .../sql/streaming/JavaStructuredNetworkWordCount.java| 11 ++- 2 files changed, 8 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/673a80d2/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java -- diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java index ccf7664..196e67d 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java @@ -522,7 +522,8 @@ public final class UnsafeExternalSorter extends MemoryConsumer { // is accessing the current record. We free this page in that caller's next loadNext() // call. for (MemoryBlock page : allocatedPages) { -if (!loaded || page.pageNumber != ((UnsafeInMemorySorter.SortedIterator)upstream).getCurrentPageNumber()) { +if (!loaded || page.pageNumber != + ((UnsafeInMemorySorter.SortedIterator)upstream).getCurrentPageNumber()) { released += page.size(); freePage(page); } else { http://git-wip-us.apache.org/repos/asf/spark/blob/673a80d2/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java -- diff --git a/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java b/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java index c913ee0..5f342e1 100644 --- a/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java +++ b/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java @@ -61,11 +61,12 @@ public final class JavaStructuredNetworkWordCount { .load(); // Split the lines into words -Dataset words = lines.as(Encoders.STRING()).flatMap(new FlatMapFunction () { - @Override - public Iterator call(String x) { -return Arrays.asList(x.split(" ")).iterator(); - } +Dataset words = lines.as(Encoders.STRING()) + .flatMap(new FlatMapFunction () { +@Override +public Iterator call(String x) { + return Arrays.asList(x.split(" ")).iterator(); +} }, Encoders.STRING()); // Generate running word count - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-17186][SQL] remove catalog table type INDEX
Repository: spark Updated Branches: refs/heads/branch-2.0 a6e6a047b -> df87f161c [SPARK-17186][SQL] remove catalog table type INDEX ## What changes were proposed in this pull request? Actually Spark SQL doesn't support index, the catalog table type `INDEX` is from Hive. However, most operations in Spark SQL can't handle index table, e.g. create table, alter table, etc. Logically index table should be invisible to end users, and Hive also generates special table name for index table to avoid users accessing it directly. Hive has special SQL syntax to create/show/drop index tables. At Spark SQL side, although we can describe index table directly, but the result is unreadable, we should use the dedicated SQL syntax to do it(e.g. `SHOW INDEX ON tbl`). Spark SQL can also read index table directly, but the result is always empty.(Can hive read index table directly?) This PR remove the table type `INDEX`, to make it clear that Spark SQL doesn't support index currently. ## How was this patch tested? existing tests. Author: Wenchen FanCloses #14752 from cloud-fan/minor2. (cherry picked from commit 52fa45d62a5a0bc832442f38f9e634c5d8e29e08) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/df87f161 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/df87f161 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/df87f161 Branch: refs/heads/branch-2.0 Commit: df87f161c9e40a49235ea722f6a662a488b41c4c Parents: a6e6a04 Author: Wenchen Fan Authored: Tue Aug 23 23:46:09 2016 -0700 Committer: Reynold Xin Committed: Tue Aug 23 23:46:17 2016 -0700 -- .../org/apache/spark/sql/catalyst/catalog/interface.scala| 1 - .../org/apache/spark/sql/execution/command/tables.scala | 8 +++- .../scala/org/apache/spark/sql/hive/MetastoreRelation.scala | 1 - .../org/apache/spark/sql/hive/client/HiveClientImpl.scala| 4 ++-- .../apache/spark/sql/hive/execution/HiveCommandSuite.scala | 2 +- 5 files changed, 6 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/df87f161/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala index 6197aca..c083cf6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala @@ -203,7 +203,6 @@ case class CatalogTableType private(name: String) object CatalogTableType { val EXTERNAL = new CatalogTableType("EXTERNAL") val MANAGED = new CatalogTableType("MANAGED") - val INDEX = new CatalogTableType("INDEX") val VIEW = new CatalogTableType("VIEW") } http://git-wip-us.apache.org/repos/asf/spark/blob/df87f161/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala index b2300b4..a5ccbcf 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala @@ -678,12 +678,11 @@ case class ShowPartitionsCommand( * Validate and throws an [[AnalysisException]] exception under the following conditions: * 1. If the table is not partitioned. * 2. If it is a datasource table. - * 3. If it is a view or index table. + * 3. If it is a view. */ -if (tab.tableType == VIEW || - tab.tableType == INDEX) { +if (tab.tableType == VIEW) { throw new AnalysisException( -s"SHOW PARTITIONS is not allowed on a view or index table: ${tab.qualifiedName}") +s"SHOW PARTITIONS is not allowed on a view: ${tab.qualifiedName}") } if (!DDLUtils.isTablePartitioned(tab)) { @@ -765,7 +764,6 @@ case class ShowCreateTableCommand(table: TableIdentifier) extends RunnableComman case EXTERNAL => " EXTERNAL TABLE" case VIEW => " VIEW" case MANAGED => " TABLE" - case INDEX => reportUnsupportedError(Seq("index table")) } builder ++= s"CREATE$tableTypeString ${table.quotedString}" http://git-wip-us.apache.org/repos/asf/spark/blob/df87f161/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala
spark git commit: [SPARK-17186][SQL] remove catalog table type INDEX
Repository: spark Updated Branches: refs/heads/master b9994ad05 -> 52fa45d62 [SPARK-17186][SQL] remove catalog table type INDEX ## What changes were proposed in this pull request? Actually Spark SQL doesn't support index, the catalog table type `INDEX` is from Hive. However, most operations in Spark SQL can't handle index table, e.g. create table, alter table, etc. Logically index table should be invisible to end users, and Hive also generates special table name for index table to avoid users accessing it directly. Hive has special SQL syntax to create/show/drop index tables. At Spark SQL side, although we can describe index table directly, but the result is unreadable, we should use the dedicated SQL syntax to do it(e.g. `SHOW INDEX ON tbl`). Spark SQL can also read index table directly, but the result is always empty.(Can hive read index table directly?) This PR remove the table type `INDEX`, to make it clear that Spark SQL doesn't support index currently. ## How was this patch tested? existing tests. Author: Wenchen FanCloses #14752 from cloud-fan/minor2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/52fa45d6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/52fa45d6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/52fa45d6 Branch: refs/heads/master Commit: 52fa45d62a5a0bc832442f38f9e634c5d8e29e08 Parents: b9994ad Author: Wenchen Fan Authored: Tue Aug 23 23:46:09 2016 -0700 Committer: Reynold Xin Committed: Tue Aug 23 23:46:09 2016 -0700 -- .../org/apache/spark/sql/catalyst/catalog/interface.scala| 1 - .../org/apache/spark/sql/execution/command/tables.scala | 8 +++- .../scala/org/apache/spark/sql/hive/MetastoreRelation.scala | 1 - .../org/apache/spark/sql/hive/client/HiveClientImpl.scala| 4 ++-- .../apache/spark/sql/hive/execution/HiveCommandSuite.scala | 2 +- 5 files changed, 6 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/52fa45d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala index f7762e0..83e01f9 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala @@ -200,7 +200,6 @@ case class CatalogTableType private(name: String) object CatalogTableType { val EXTERNAL = new CatalogTableType("EXTERNAL") val MANAGED = new CatalogTableType("MANAGED") - val INDEX = new CatalogTableType("INDEX") val VIEW = new CatalogTableType("VIEW") } http://git-wip-us.apache.org/repos/asf/spark/blob/52fa45d6/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala index 21544a3..b4a15b8 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala @@ -620,12 +620,11 @@ case class ShowPartitionsCommand( * Validate and throws an [[AnalysisException]] exception under the following conditions: * 1. If the table is not partitioned. * 2. If it is a datasource table. - * 3. If it is a view or index table. + * 3. If it is a view. */ -if (tab.tableType == VIEW || - tab.tableType == INDEX) { +if (tab.tableType == VIEW) { throw new AnalysisException( -s"SHOW PARTITIONS is not allowed on a view or index table: ${tab.qualifiedName}") +s"SHOW PARTITIONS is not allowed on a view: ${tab.qualifiedName}") } if (tab.partitionColumnNames.isEmpty) { @@ -708,7 +707,6 @@ case class ShowCreateTableCommand(table: TableIdentifier) extends RunnableComman case EXTERNAL => " EXTERNAL TABLE" case VIEW => " VIEW" case MANAGED => " TABLE" - case INDEX => reportUnsupportedError(Seq("index table")) } builder ++= s"CREATE$tableTypeString ${table.quotedString}" http://git-wip-us.apache.org/repos/asf/spark/blob/52fa45d6/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala
spark git commit: [MINOR][SQL] Remove implemented functions from comments of 'HiveSessionCatalog.scala'
Repository: spark Updated Branches: refs/heads/branch-2.0 a772b4b5d -> a6e6a047b [MINOR][SQL] Remove implemented functions from comments of 'HiveSessionCatalog.scala' ## What changes were proposed in this pull request? This PR removes implemented functions from comments of `HiveSessionCatalog.scala`: `java_method`, `posexplode`, `str_to_map`. ## How was this patch tested? Manual. Author: Weiqing YangCloses #14769 from Sherry302/cleanComment. (cherry picked from commit b9994ad05628077016331e6b411fbc09017b1e63) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a6e6a047 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a6e6a047 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a6e6a047 Branch: refs/heads/branch-2.0 Commit: a6e6a047bb9215df55b009957d4c560624d886fc Parents: a772b4b Author: Weiqing Yang Authored: Tue Aug 23 23:44:45 2016 -0700 Committer: Reynold Xin Committed: Tue Aug 23 23:45:00 2016 -0700 -- .../scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a6e6a047/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala index c59ac3d..1684e8d 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala @@ -230,10 +230,8 @@ private[sql] class HiveSessionCatalog( // List of functions we are explicitly not supporting are: // compute_stats, context_ngrams, create_union, // current_user, ewah_bitmap, ewah_bitmap_and, ewah_bitmap_empty, ewah_bitmap_or, field, - // in_file, index, java_method, - // matchpath, ngrams, noop, noopstreaming, noopwithmap, noopwithmapstreaming, - // parse_url_tuple, posexplode, reflect2, - // str_to_map, windowingtablefunction. + // in_file, index, matchpath, ngrams, noop, noopstreaming, noopwithmap, + // noopwithmapstreaming, parse_url_tuple, reflect2, windowingtablefunction. private val hiveFunctions = Seq( "hash", "histogram_numeric", - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][SQL] Remove implemented functions from comments of 'HiveSessionCatalog.scala'
Repository: spark Updated Branches: refs/heads/master c1937dd19 -> b9994ad05 [MINOR][SQL] Remove implemented functions from comments of 'HiveSessionCatalog.scala' ## What changes were proposed in this pull request? This PR removes implemented functions from comments of `HiveSessionCatalog.scala`: `java_method`, `posexplode`, `str_to_map`. ## How was this patch tested? Manual. Author: Weiqing YangCloses #14769 from Sherry302/cleanComment. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b9994ad0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b9994ad0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b9994ad0 Branch: refs/heads/master Commit: b9994ad05628077016331e6b411fbc09017b1e63 Parents: c1937dd Author: Weiqing Yang Authored: Tue Aug 23 23:44:45 2016 -0700 Committer: Reynold Xin Committed: Tue Aug 23 23:44:45 2016 -0700 -- .../scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b9994ad0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala index ebed9eb..ca8c734 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala @@ -230,10 +230,8 @@ private[sql] class HiveSessionCatalog( // List of functions we are explicitly not supporting are: // compute_stats, context_ngrams, create_union, // current_user, ewah_bitmap, ewah_bitmap_and, ewah_bitmap_empty, ewah_bitmap_or, field, - // in_file, index, java_method, - // matchpath, ngrams, noop, noopstreaming, noopwithmap, noopwithmapstreaming, - // parse_url_tuple, posexplode, reflect2, - // str_to_map, windowingtablefunction. + // in_file, index, matchpath, ngrams, noop, noopstreaming, noopwithmap, + // noopwithmapstreaming, parse_url_tuple, reflect2, windowingtablefunction. private val hiveFunctions = Seq( "hash", "histogram_numeric", - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org