spark git commit: [SPARK-17190][SQL] Removal of HiveSharedState

2016-08-24 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master ac27557eb -> 4d0706d61


[SPARK-17190][SQL] Removal of HiveSharedState

### What changes were proposed in this pull request?
Since `HiveClient` is used to interact with the Hive metastore, it should be 
hidden in `HiveExternalCatalog`. After moving `HiveClient` into 
`HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of 
`HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes 
straightforward. After removal of `HiveSharedState`, the reflection logic is 
directly applied on the choice of `ExternalCatalog` types, based on the 
configuration of `CATALOG_IMPLEMENTATION`.

~~`HiveClient` is also used/invoked by the other entities besides 
HiveExternalCatalog, we defines the following two APIs: getClient and 
getNewClient~~

### How was this patch tested?
The existing test cases

Author: gatorsmile 

Closes #14757 from gatorsmile/removeHiveClient.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4d0706d6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4d0706d6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4d0706d6

Branch: refs/heads/master
Commit: 4d0706d616176dc29ff3562e40cb00dd4eb9c302
Parents: ac27557
Author: gatorsmile 
Authored: Thu Aug 25 12:50:03 2016 +0800
Committer: Wenchen Fan 
Committed: Thu Aug 25 12:50:03 2016 +0800

--
 .../sql/catalyst/catalog/InMemoryCatalog.scala  |  8 +++-
 .../org/apache/spark/sql/SparkSession.scala | 14 +-
 .../apache/spark/sql/internal/SharedState.scala | 47 +++-
 .../hive/thriftserver/HiveThriftServer2.scala   |  2 +-
 .../org/apache/spark/sql/hive/HiveContext.scala |  4 --
 .../spark/sql/hive/HiveExternalCatalog.scala| 10 -
 .../spark/sql/hive/HiveMetastoreCatalog.scala   |  3 +-
 .../spark/sql/hive/HiveSessionState.scala   |  9 ++--
 .../apache/spark/sql/hive/HiveSharedState.scala | 47 
 .../apache/spark/sql/hive/test/TestHive.scala   | 15 +++
 .../spark/sql/hive/HiveDataFrameSuite.scala |  2 +-
 .../sql/hive/HiveExternalCatalogSuite.scala | 16 +++
 .../spark/sql/hive/HiveSparkSubmitSuite.scala   |  5 +--
 .../sql/hive/MetastoreDataSourcesSuite.scala|  3 +-
 .../spark/sql/hive/ShowCreateTableSuite.scala   |  2 +-
 15 files changed, 88 insertions(+), 99 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4d0706d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala
index 9ebf7de..b55ddcb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala
@@ -24,7 +24,7 @@ import scala.collection.mutable
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.Path
 
-import org.apache.spark.SparkException
+import org.apache.spark.{SparkConf, SparkException}
 import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier}
 import org.apache.spark.sql.catalyst.analysis._
@@ -39,7 +39,11 @@ import org.apache.spark.sql.catalyst.util.StringUtils
  *
  * All public methods should be synchronized for thread-safety.
  */
-class InMemoryCatalog(hadoopConfig: Configuration = new Configuration) extends 
ExternalCatalog {
+class InMemoryCatalog(
+conf: SparkConf = new SparkConf,
+hadoopConfig: Configuration = new Configuration)
+  extends ExternalCatalog {
+
   import CatalogTypes.TablePartitionSpec
 
   private class TableDesc(var table: CatalogTable) {

http://git-wip-us.apache.org/repos/asf/spark/blob/4d0706d6/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
index 362bf45..0f6292d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
@@ -96,10 +96,7 @@ class SparkSession private(
*/
   @transient
   private[sql] lazy val sharedState: SharedState = {
-existingSharedState.getOrElse(
-  SparkSession.reflect[SharedState, SparkContext](
-SparkSession.sharedStateClassName(sparkContext.conf),
-sparkContext))
+existingSharedState.getOrElse(new SharedState(sparkContext))
   }
 
   /**
@@ 

spark git commit: [SPARK-17228][SQL] Not infer/propagate non-deterministic constraints

2016-08-24 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 3258f27a8 -> aa57083af


[SPARK-17228][SQL] Not infer/propagate non-deterministic constraints

## What changes were proposed in this pull request?

Given that filters based on non-deterministic constraints shouldn't be pushed 
down in the query plan, unnecessarily inferring them is confusing and a source 
of potential bugs. This patch simplifies the inferring logic by simply ignoring 
them.

## How was this patch tested?

Added a new test in `ConstraintPropagationSuite`.

Author: Sameer Agarwal 

Closes #14795 from sameeragarwal/deterministic-constraints.

(cherry picked from commit ac27557eb622a257abeb3e8551f06ebc72f87133)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aa57083a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aa57083a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aa57083a

Branch: refs/heads/branch-2.0
Commit: aa57083af4cecb595bac09e437607d7142b54913
Parents: 3258f27
Author: Sameer Agarwal 
Authored: Wed Aug 24 21:24:24 2016 -0700
Committer: Reynold Xin 
Committed: Wed Aug 24 21:24:31 2016 -0700

--
 .../spark/sql/catalyst/plans/QueryPlan.scala   |  3 ++-
 .../plans/ConstraintPropagationSuite.scala | 17 +
 2 files changed, 19 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aa57083a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
index cf34f4b..9c60590 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
@@ -35,7 +35,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] 
extends TreeNode[PlanT
   .union(inferAdditionalConstraints(constraints))
   .union(constructIsNotNullConstraints(constraints))
   .filter(constraint =>
-constraint.references.nonEmpty && 
constraint.references.subsetOf(outputSet))
+constraint.references.nonEmpty && 
constraint.references.subsetOf(outputSet) &&
+  constraint.deterministic)
   }
 
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/aa57083a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala
index 5a76969..8d6a49a 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala
@@ -352,4 +352,21 @@ class ConstraintPropagationSuite extends SparkFunSuite {
 verifyConstraints(tr.analyze.constraints,
   ExpressionSet(Seq(IsNotNull(resolveColumn(tr, "b")), 
IsNotNull(resolveColumn(tr, "c")
   }
+
+  test("not infer non-deterministic constraints") {
+val tr = LocalRelation('a.int, 'b.string, 'c.int)
+
+verifyConstraints(tr
+  .where('a.attr === Rand(0))
+  .analyze.constraints,
+  ExpressionSet(Seq(IsNotNull(resolveColumn(tr, "a")
+
+verifyConstraints(tr
+  .where('a.attr === InputFileName())
+  .where('a.attr =!= 'c.attr)
+  .analyze.constraints,
+  ExpressionSet(Seq(resolveColumn(tr, "a") =!= resolveColumn(tr, "c"),
+IsNotNull(resolveColumn(tr, "a")),
+IsNotNull(resolveColumn(tr, "c")
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-17228][SQL] Not infer/propagate non-deterministic constraints

2016-08-24 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 3a60be4b1 -> ac27557eb


[SPARK-17228][SQL] Not infer/propagate non-deterministic constraints

## What changes were proposed in this pull request?

Given that filters based on non-deterministic constraints shouldn't be pushed 
down in the query plan, unnecessarily inferring them is confusing and a source 
of potential bugs. This patch simplifies the inferring logic by simply ignoring 
them.

## How was this patch tested?

Added a new test in `ConstraintPropagationSuite`.

Author: Sameer Agarwal 

Closes #14795 from sameeragarwal/deterministic-constraints.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ac27557e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ac27557e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ac27557e

Branch: refs/heads/master
Commit: ac27557eb622a257abeb3e8551f06ebc72f87133
Parents: 3a60be4
Author: Sameer Agarwal 
Authored: Wed Aug 24 21:24:24 2016 -0700
Committer: Reynold Xin 
Committed: Wed Aug 24 21:24:24 2016 -0700

--
 .../spark/sql/catalyst/plans/QueryPlan.scala   |  3 ++-
 .../plans/ConstraintPropagationSuite.scala | 17 +
 2 files changed, 19 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ac27557e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
index 8ee31f4..0fb6e7d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
@@ -35,7 +35,8 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] 
extends TreeNode[PlanT
   .union(inferAdditionalConstraints(constraints))
   .union(constructIsNotNullConstraints(constraints))
   .filter(constraint =>
-constraint.references.nonEmpty && 
constraint.references.subsetOf(outputSet))
+constraint.references.nonEmpty && 
constraint.references.subsetOf(outputSet) &&
+  constraint.deterministic)
   }
 
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/ac27557e/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala
index 5a76969..8d6a49a 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/ConstraintPropagationSuite.scala
@@ -352,4 +352,21 @@ class ConstraintPropagationSuite extends SparkFunSuite {
 verifyConstraints(tr.analyze.constraints,
   ExpressionSet(Seq(IsNotNull(resolveColumn(tr, "b")), 
IsNotNull(resolveColumn(tr, "c")
   }
+
+  test("not infer non-deterministic constraints") {
+val tr = LocalRelation('a.int, 'b.string, 'c.int)
+
+verifyConstraints(tr
+  .where('a.attr === Rand(0))
+  .analyze.constraints,
+  ExpressionSet(Seq(IsNotNull(resolveColumn(tr, "a")
+
+verifyConstraints(tr
+  .where('a.attr === InputFileName())
+  .where('a.attr =!= 'c.attr)
+  .analyze.constraints,
+  ExpressionSet(Seq(resolveColumn(tr, "a") =!= resolveColumn(tr, "c"),
+IsNotNull(resolveColumn(tr, "a")),
+IsNotNull(resolveColumn(tr, "c")
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-16216][SQL][BRANCH-2.0] Backport Read/write dateFormat/timestampFormat options for CSV and JSON

2016-08-24 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 9f363a690 -> 3258f27a8


[SPARK-16216][SQL][BRANCH-2.0] Backport Read/write dateFormat/timestampFormat 
options for CSV and JSON

## What changes were proposed in this pull request?

This PR backports https://github.com/apache/spark/pull/14279 to 2.0.

## How was this patch tested?

Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests 
cover the default cases.

Author: hyukjinkwon 

Closes #14799 from HyukjinKwon/SPARK-16216-json-csv-backport.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3258f27a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3258f27a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3258f27a

Branch: refs/heads/branch-2.0
Commit: 3258f27a881dfeb5ab8bae90c338603fa4b6f9d8
Parents: 9f363a6
Author: hyukjinkwon 
Authored: Wed Aug 24 21:19:35 2016 -0700
Committer: Reynold Xin 
Committed: Wed Aug 24 21:19:35 2016 -0700

--
 python/pyspark/sql/readwriter.py|  56 +--
 python/pyspark/sql/streaming.py |  30 +++-
 .../org/apache/spark/sql/DataFrameReader.scala  |  17 +-
 .../org/apache/spark/sql/DataFrameWriter.scala  |  12 ++
 .../datasources/csv/CSVInferSchema.scala|  42 ++---
 .../execution/datasources/csv/CSVOptions.scala  |  15 +-
 .../execution/datasources/csv/CSVRelation.scala |  43 -
 .../datasources/json/JSONOptions.scala  |   9 ++
 .../datasources/json/JacksonGenerator.scala |  14 +-
 .../datasources/json/JacksonParser.scala|  68 
 .../datasources/json/JsonFileFormat.scala   |   5 +-
 .../spark/sql/streaming/DataStreamReader.scala  |  16 +-
 .../datasources/csv/CSVInferSchemaSuite.scala   |   4 +-
 .../execution/datasources/csv/CSVSuite.scala| 156 ++-
 .../datasources/csv/CSVTypeCastSuite.scala  |  17 +-
 .../execution/datasources/json/JsonSuite.scala  |  74 -
 .../datasources/json/TestJsonData.scala |   6 +
 .../sql/sources/JsonHadoopFsRelationSuite.scala |   4 +
 18 files changed, 478 insertions(+), 110 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3258f27a/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 64de33e..3da6f49 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -156,7 +156,7 @@ class DataFrameReader(OptionUtils):
 def json(self, path, schema=None, primitivesAsString=None, 
prefersDecimal=None,
  allowComments=None, allowUnquotedFieldNames=None, 
allowSingleQuotes=None,
  allowNumericLeadingZero=None, 
allowBackslashEscapingAnyCharacter=None,
- mode=None, columnNameOfCorruptRecord=None):
+ mode=None, columnNameOfCorruptRecord=None, dateFormat=None, 
timestampFormat=None):
 """
 Loads a JSON file (one object per line) or an RDD of Strings storing 
JSON objects
 (one object per record) and returns the result as a :class`DataFrame`.
@@ -198,6 +198,14 @@ class DataFrameReader(OptionUtils):
   
``spark.sql.columnNameOfCorruptRecord``. If None is set,
   it uses the value specified in
   
``spark.sql.columnNameOfCorruptRecord``.
+:param dateFormat: sets the string that indicates a date format. 
Custom date formats
+   follow the formats at 
``java.text.SimpleDateFormat``. This
+   applies to date type. If None is set, it uses the
+   default value value, ``-MM-dd``.
+:param timestampFormat: sets the string that indicates a timestamp 
format. Custom date
+formats follow the formats at 
``java.text.SimpleDateFormat``.
+This applies to timestamp type. If None is 
set, it uses the
+default value value, 
``-MM-dd'T'HH:mm:ss.SSSZZ``.
 
 >>> df1 = spark.read.json('python/test_support/sql/people.json')
 >>> df1.dtypes
@@ -213,7 +221,8 @@ class DataFrameReader(OptionUtils):
 allowComments=allowComments, 
allowUnquotedFieldNames=allowUnquotedFieldNames,
 allowSingleQuotes=allowSingleQuotes, 
allowNumericLeadingZero=allowNumericLeadingZero,
 
allowBackslashEscapingAnyCharacter=allowBackslashEscapingAnyCharacter,
-mode=mode, columnNameOfCorruptRecord=columnNameOfCorruptRecord)
+mode=mode, columnNameOfCorruptRecord=columnNameOfCorruptRecord, 

spark git commit: [SPARKR][MINOR] Add more examples to window function docs

2016-08-24 Thread felixcheung
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 9f924a01b -> 43273377a


[SPARKR][MINOR] Add more examples to window function docs

## What changes were proposed in this pull request?

This PR adds more examples to window function docs to make them more accessible 
to the users.

It also fixes default value issues for `lag` and `lead`.

## How was this patch tested?

Manual test, R unit test.

Author: Junyang Qian 

Closes #14779 from junyangq/SPARKR-FixWindowFunctionDocs.

(cherry picked from commit 18708f76c366c6e01b5865981666e40d8642ac20)
Signed-off-by: Felix Cheung 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/43273377
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/43273377
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/43273377

Branch: refs/heads/branch-2.0
Commit: 43273377a38a9136ff5e56929630930f076af5af
Parents: 9f924a0
Author: Junyang Qian 
Authored: Wed Aug 24 16:00:04 2016 -0700
Committer: Felix Cheung 
Committed: Wed Aug 24 16:00:18 2016 -0700

--
 R/pkg/R/WindowSpec.R | 12 
 R/pkg/R/functions.R  | 78 ---
 2 files changed, 72 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/43273377/R/pkg/R/WindowSpec.R
--
diff --git a/R/pkg/R/WindowSpec.R b/R/pkg/R/WindowSpec.R
index ddd2ef2..4ac83c2 100644
--- a/R/pkg/R/WindowSpec.R
+++ b/R/pkg/R/WindowSpec.R
@@ -203,6 +203,18 @@ setMethod("rangeBetween",
 #' @aliases over,Column,WindowSpec-method
 #' @family colum_func
 #' @export
+#' @examples \dontrun{
+#'   df <- createDataFrame(mtcars)
+#'
+#'   # Partition by am (transmission) and order by hp (horsepower)
+#'   ws <- orderBy(windowPartitionBy("am"), "hp")
+#'
+#'   # Rank on hp within each partition
+#'   out <- select(df, over(rank(), ws), df$hp, df$am)
+#'
+#'   # Lag mpg values by 1 row on the partition-and-ordered table
+#'   out <- select(df, over(lead(df$mpg), ws), df$mpg, df$hp, df$am)
+#' }
 #' @note over since 2.0.0
 setMethod("over",
   signature(x = "Column", window = "WindowSpec"),

http://git-wip-us.apache.org/repos/asf/spark/blob/43273377/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index f042add..dbf8dd8 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -3121,9 +3121,9 @@ setMethod("ifelse",
 #' @aliases cume_dist,missing-method
 #' @export
 #' @examples \dontrun{
-#'   df <- createDataFrame(iris)
-#'   ws <- orderBy(windowPartitionBy("Species"), "Sepal_Length")
-#'   out <- select(df, over(cume_dist(), ws), df$Sepal_Length, df$Species)
+#'   df <- createDataFrame(mtcars)
+#'   ws <- orderBy(windowPartitionBy("am"), "hp")
+#'   out <- select(df, over(cume_dist(), ws), df$hp, df$am)
 #' }
 #' @note cume_dist since 1.6.0
 setMethod("cume_dist",
@@ -3148,7 +3148,11 @@ setMethod("cume_dist",
 #' @family window_funcs
 #' @aliases dense_rank,missing-method
 #' @export
-#' @examples \dontrun{dense_rank()}
+#' @examples \dontrun{
+#'   df <- createDataFrame(mtcars)
+#'   ws <- orderBy(windowPartitionBy("am"), "hp")
+#'   out <- select(df, over(dense_rank(), ws), df$hp, df$am)
+#' }
 #' @note dense_rank since 1.6.0
 setMethod("dense_rank",
   signature("missing"),
@@ -3168,18 +3172,26 @@ setMethod("dense_rank",
 #' @param x the column as a character string or a Column to compute on.
 #' @param offset the number of rows back from the current row from which to 
obtain a value.
 #'   If not specified, the default is 1.
-#' @param defaultValue default to use when the offset row does not exist.
+#' @param defaultValue (optional) default to use when the offset row does not 
exist.
 #' @param ... further arguments to be passed to or from other methods.
 #' @rdname lag
 #' @name lag
 #' @aliases lag,characterOrColumn-method
 #' @family window_funcs
 #' @export
-#' @examples \dontrun{lag(df$c)}
+#' @examples \dontrun{
+#'   df <- createDataFrame(mtcars)
+#'
+#'   # Partition by am (transmission) and order by hp (horsepower)
+#'   ws <- orderBy(windowPartitionBy("am"), "hp")
+#'
+#'   # Lag mpg values by 1 row on the partition-and-ordered table
+#'   out <- select(df, over(lag(df$mpg), ws), df$mpg, df$hp, df$am)
+#' }
 #' @note lag since 1.6.0
 setMethod("lag",
   signature(x = "characterOrColumn"),
-  function(x, offset, defaultValue = NULL) {
+  function(x, offset = 1, defaultValue = NULL) {
 col <- if (class(x) == "Column") {
   x@jc
 } else {
@@ -3194,25 +3206,35 @@ setMethod("lag",
 #' lead
 #'
 #' Window function: 

spark git commit: [SPARKR][MINOR] Add installation message for remote master mode and improve other messages

2016-08-24 Thread felixcheung
Repository: spark
Updated Branches:
  refs/heads/master 18708f76c -> 3a60be4b1


[SPARKR][MINOR] Add installation message for remote master mode and improve 
other messages

## What changes were proposed in this pull request?

This PR gives informative message to users when they try to connect to a remote 
master but don't have Spark package in their local machine.

As a clarification, for now, automatic installation will only happen if they 
start SparkR in R console (rather than from sparkr-shell) and connect to local 
master. In the remote master mode, local Spark package is still needed, but we 
will not trigger the install.spark function because the versions have to match 
those on the cluster, which involves more user input. Instead, we here try to 
provide detailed message that may help the users.

Some of the other messages have also been slightly changed.

## How was this patch tested?

Manual test.

Author: Junyang Qian 

Closes #14761 from junyangq/SPARK-16579-V1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3a60be4b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3a60be4b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3a60be4b

Branch: refs/heads/master
Commit: 3a60be4b15a5ab9b6e0c4839df99dac7738aa7fe
Parents: 18708f7
Author: Junyang Qian 
Authored: Wed Aug 24 16:04:14 2016 -0700
Committer: Felix Cheung 
Committed: Wed Aug 24 16:04:14 2016 -0700

--
 R/pkg/R/install.R | 64 ++
 R/pkg/R/sparkR.R  | 51 ++--
 R/pkg/R/utils.R   |  4 ++--
 3 files changed, 80 insertions(+), 39 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3a60be4b/R/pkg/R/install.R
--
diff --git a/R/pkg/R/install.R b/R/pkg/R/install.R
index c6ed88e..69b0a52 100644
--- a/R/pkg/R/install.R
+++ b/R/pkg/R/install.R
@@ -70,9 +70,9 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl = 
NULL,
   localDir = NULL, overwrite = FALSE) {
   version <- paste0("spark-", packageVersion("SparkR"))
   hadoopVersion <- tolower(hadoopVersion)
-  hadoopVersionName <- hadoop_version_name(hadoopVersion)
+  hadoopVersionName <- hadoopVersionName(hadoopVersion)
   packageName <- paste(version, "bin", hadoopVersionName, sep = "-")
-  localDir <- ifelse(is.null(localDir), spark_cache_path(),
+  localDir <- ifelse(is.null(localDir), sparkCachePath(),
  normalizePath(localDir, mustWork = FALSE))
 
   if (is.na(file.info(localDir)$isdir)) {
@@ -88,12 +88,14 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl 
= NULL,
 
   # can use dir.exists(packageLocalDir) under R 3.2.0 or later
   if (!is.na(file.info(packageLocalDir)$isdir) && !overwrite) {
-fmt <- "Spark %s for Hadoop %s is found, and SPARK_HOME set to %s"
+fmt <- "%s for Hadoop %s found, with SPARK_HOME set to %s"
 msg <- sprintf(fmt, version, ifelse(hadoopVersion == "without", "Free 
build", hadoopVersion),
packageLocalDir)
 message(msg)
 Sys.setenv(SPARK_HOME = packageLocalDir)
 return(invisible(packageLocalDir))
+  } else {
+message("Spark not found in the cache directory. Installation will start.")
   }
 
   packageLocalPath <- paste0(packageLocalDir, ".tgz")
@@ -102,7 +104,7 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl 
= NULL,
   if (tarExists && !overwrite) {
 message("tar file found.")
   } else {
-robust_download_tar(mirrorUrl, version, hadoopVersion, packageName, 
packageLocalPath)
+robustDownloadTar(mirrorUrl, version, hadoopVersion, packageName, 
packageLocalPath)
   }
 
   message(sprintf("Installing to %s", localDir))
@@ -116,33 +118,37 @@ install.spark <- function(hadoopVersion = "2.7", 
mirrorUrl = NULL,
   invisible(packageLocalDir)
 }
 
-robust_download_tar <- function(mirrorUrl, version, hadoopVersion, 
packageName, packageLocalPath) {
+robustDownloadTar <- function(mirrorUrl, version, hadoopVersion, packageName, 
packageLocalPath) {
   # step 1: use user-provided url
   if (!is.null(mirrorUrl)) {
 msg <- sprintf("Use user-provided mirror site: %s.", mirrorUrl)
 message(msg)
-success <- direct_download_tar(mirrorUrl, version, hadoopVersion,
+success <- directDownloadTar(mirrorUrl, version, hadoopVersion,
packageName, packageLocalPath)
-if (success) return()
+if (success) {
+  return()
+} else {
+  message(paste0("Unable to download from mirrorUrl: ", mirrorUrl))
+}
   } else {
-message("Mirror site not provided.")
+message("MirrorUrl not provided.")
   }
 
   # step 2: use url suggested from 

spark git commit: [SPARKR][MINOR] Add installation message for remote master mode and improve other messages

2016-08-24 Thread felixcheung
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 43273377a -> 9f363a690


[SPARKR][MINOR] Add installation message for remote master mode and improve 
other messages

## What changes were proposed in this pull request?

This PR gives informative message to users when they try to connect to a remote 
master but don't have Spark package in their local machine.

As a clarification, for now, automatic installation will only happen if they 
start SparkR in R console (rather than from sparkr-shell) and connect to local 
master. In the remote master mode, local Spark package is still needed, but we 
will not trigger the install.spark function because the versions have to match 
those on the cluster, which involves more user input. Instead, we here try to 
provide detailed message that may help the users.

Some of the other messages have also been slightly changed.

## How was this patch tested?

Manual test.

Author: Junyang Qian 

Closes #14761 from junyangq/SPARK-16579-V1.

(cherry picked from commit 3a60be4b15a5ab9b6e0c4839df99dac7738aa7fe)
Signed-off-by: Felix Cheung 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9f363a69
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9f363a69
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9f363a69

Branch: refs/heads/branch-2.0
Commit: 9f363a690102f04a2a486853c1b89134455518bc
Parents: 4327337
Author: Junyang Qian 
Authored: Wed Aug 24 16:04:14 2016 -0700
Committer: Felix Cheung 
Committed: Wed Aug 24 16:04:26 2016 -0700

--
 R/pkg/R/install.R | 64 ++
 R/pkg/R/sparkR.R  | 51 ++--
 R/pkg/R/utils.R   |  4 ++--
 3 files changed, 80 insertions(+), 39 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9f363a69/R/pkg/R/install.R
--
diff --git a/R/pkg/R/install.R b/R/pkg/R/install.R
index c6ed88e..69b0a52 100644
--- a/R/pkg/R/install.R
+++ b/R/pkg/R/install.R
@@ -70,9 +70,9 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl = 
NULL,
   localDir = NULL, overwrite = FALSE) {
   version <- paste0("spark-", packageVersion("SparkR"))
   hadoopVersion <- tolower(hadoopVersion)
-  hadoopVersionName <- hadoop_version_name(hadoopVersion)
+  hadoopVersionName <- hadoopVersionName(hadoopVersion)
   packageName <- paste(version, "bin", hadoopVersionName, sep = "-")
-  localDir <- ifelse(is.null(localDir), spark_cache_path(),
+  localDir <- ifelse(is.null(localDir), sparkCachePath(),
  normalizePath(localDir, mustWork = FALSE))
 
   if (is.na(file.info(localDir)$isdir)) {
@@ -88,12 +88,14 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl 
= NULL,
 
   # can use dir.exists(packageLocalDir) under R 3.2.0 or later
   if (!is.na(file.info(packageLocalDir)$isdir) && !overwrite) {
-fmt <- "Spark %s for Hadoop %s is found, and SPARK_HOME set to %s"
+fmt <- "%s for Hadoop %s found, with SPARK_HOME set to %s"
 msg <- sprintf(fmt, version, ifelse(hadoopVersion == "without", "Free 
build", hadoopVersion),
packageLocalDir)
 message(msg)
 Sys.setenv(SPARK_HOME = packageLocalDir)
 return(invisible(packageLocalDir))
+  } else {
+message("Spark not found in the cache directory. Installation will start.")
   }
 
   packageLocalPath <- paste0(packageLocalDir, ".tgz")
@@ -102,7 +104,7 @@ install.spark <- function(hadoopVersion = "2.7", mirrorUrl 
= NULL,
   if (tarExists && !overwrite) {
 message("tar file found.")
   } else {
-robust_download_tar(mirrorUrl, version, hadoopVersion, packageName, 
packageLocalPath)
+robustDownloadTar(mirrorUrl, version, hadoopVersion, packageName, 
packageLocalPath)
   }
 
   message(sprintf("Installing to %s", localDir))
@@ -116,33 +118,37 @@ install.spark <- function(hadoopVersion = "2.7", 
mirrorUrl = NULL,
   invisible(packageLocalDir)
 }
 
-robust_download_tar <- function(mirrorUrl, version, hadoopVersion, 
packageName, packageLocalPath) {
+robustDownloadTar <- function(mirrorUrl, version, hadoopVersion, packageName, 
packageLocalPath) {
   # step 1: use user-provided url
   if (!is.null(mirrorUrl)) {
 msg <- sprintf("Use user-provided mirror site: %s.", mirrorUrl)
 message(msg)
-success <- direct_download_tar(mirrorUrl, version, hadoopVersion,
+success <- directDownloadTar(mirrorUrl, version, hadoopVersion,
packageName, packageLocalPath)
-if (success) return()
+if (success) {
+  return()
+} else {
+  message(paste0("Unable to download from mirrorUrl: ", mirrorUrl))
+}
   } else {

spark git commit: [SPARKR][MINOR] Add more examples to window function docs

2016-08-24 Thread felixcheung
Repository: spark
Updated Branches:
  refs/heads/master 945c04bcd -> 18708f76c


[SPARKR][MINOR] Add more examples to window function docs

## What changes were proposed in this pull request?

This PR adds more examples to window function docs to make them more accessible 
to the users.

It also fixes default value issues for `lag` and `lead`.

## How was this patch tested?

Manual test, R unit test.

Author: Junyang Qian 

Closes #14779 from junyangq/SPARKR-FixWindowFunctionDocs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/18708f76
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/18708f76
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/18708f76

Branch: refs/heads/master
Commit: 18708f76c366c6e01b5865981666e40d8642ac20
Parents: 945c04b
Author: Junyang Qian 
Authored: Wed Aug 24 16:00:04 2016 -0700
Committer: Felix Cheung 
Committed: Wed Aug 24 16:00:04 2016 -0700

--
 R/pkg/R/WindowSpec.R | 12 
 R/pkg/R/functions.R  | 78 ---
 2 files changed, 72 insertions(+), 18 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/18708f76/R/pkg/R/WindowSpec.R
--
diff --git a/R/pkg/R/WindowSpec.R b/R/pkg/R/WindowSpec.R
index ddd2ef2..4ac83c2 100644
--- a/R/pkg/R/WindowSpec.R
+++ b/R/pkg/R/WindowSpec.R
@@ -203,6 +203,18 @@ setMethod("rangeBetween",
 #' @aliases over,Column,WindowSpec-method
 #' @family colum_func
 #' @export
+#' @examples \dontrun{
+#'   df <- createDataFrame(mtcars)
+#'
+#'   # Partition by am (transmission) and order by hp (horsepower)
+#'   ws <- orderBy(windowPartitionBy("am"), "hp")
+#'
+#'   # Rank on hp within each partition
+#'   out <- select(df, over(rank(), ws), df$hp, df$am)
+#'
+#'   # Lag mpg values by 1 row on the partition-and-ordered table
+#'   out <- select(df, over(lead(df$mpg), ws), df$mpg, df$hp, df$am)
+#' }
 #' @note over since 2.0.0
 setMethod("over",
   signature(x = "Column", window = "WindowSpec"),

http://git-wip-us.apache.org/repos/asf/spark/blob/18708f76/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index f042add..dbf8dd8 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -3121,9 +3121,9 @@ setMethod("ifelse",
 #' @aliases cume_dist,missing-method
 #' @export
 #' @examples \dontrun{
-#'   df <- createDataFrame(iris)
-#'   ws <- orderBy(windowPartitionBy("Species"), "Sepal_Length")
-#'   out <- select(df, over(cume_dist(), ws), df$Sepal_Length, df$Species)
+#'   df <- createDataFrame(mtcars)
+#'   ws <- orderBy(windowPartitionBy("am"), "hp")
+#'   out <- select(df, over(cume_dist(), ws), df$hp, df$am)
 #' }
 #' @note cume_dist since 1.6.0
 setMethod("cume_dist",
@@ -3148,7 +3148,11 @@ setMethod("cume_dist",
 #' @family window_funcs
 #' @aliases dense_rank,missing-method
 #' @export
-#' @examples \dontrun{dense_rank()}
+#' @examples \dontrun{
+#'   df <- createDataFrame(mtcars)
+#'   ws <- orderBy(windowPartitionBy("am"), "hp")
+#'   out <- select(df, over(dense_rank(), ws), df$hp, df$am)
+#' }
 #' @note dense_rank since 1.6.0
 setMethod("dense_rank",
   signature("missing"),
@@ -3168,18 +3172,26 @@ setMethod("dense_rank",
 #' @param x the column as a character string or a Column to compute on.
 #' @param offset the number of rows back from the current row from which to 
obtain a value.
 #'   If not specified, the default is 1.
-#' @param defaultValue default to use when the offset row does not exist.
+#' @param defaultValue (optional) default to use when the offset row does not 
exist.
 #' @param ... further arguments to be passed to or from other methods.
 #' @rdname lag
 #' @name lag
 #' @aliases lag,characterOrColumn-method
 #' @family window_funcs
 #' @export
-#' @examples \dontrun{lag(df$c)}
+#' @examples \dontrun{
+#'   df <- createDataFrame(mtcars)
+#'
+#'   # Partition by am (transmission) and order by hp (horsepower)
+#'   ws <- orderBy(windowPartitionBy("am"), "hp")
+#'
+#'   # Lag mpg values by 1 row on the partition-and-ordered table
+#'   out <- select(df, over(lag(df$mpg), ws), df$mpg, df$hp, df$am)
+#' }
 #' @note lag since 1.6.0
 setMethod("lag",
   signature(x = "characterOrColumn"),
-  function(x, offset, defaultValue = NULL) {
+  function(x, offset = 1, defaultValue = NULL) {
 col <- if (class(x) == "Column") {
   x@jc
 } else {
@@ -3194,25 +3206,35 @@ setMethod("lag",
 #' lead
 #'
 #' Window function: returns the value that is \code{offset} rows after the 
current row, and
-#' NULL if there is less than \code{offset} rows after the 

spark git commit: [MINOR][SPARKR] fix R MLlib parameter documentation

2016-08-24 Thread felixcheung
Repository: spark
Updated Branches:
  refs/heads/master 29952ed09 -> 945c04bcd


[MINOR][SPARKR] fix R MLlib parameter documentation

## What changes were proposed in this pull request?

Fixed several misplaced param tag - they should be on the spark.* method 
generics

## How was this patch tested?

run knitr
junyangq

Author: Felix Cheung 

Closes #14792 from felixcheung/rdocmllib.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/945c04bc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/945c04bc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/945c04bc

Branch: refs/heads/master
Commit: 945c04bcd439e0624232c040df529f12bcc05e13
Parents: 29952ed
Author: Felix Cheung 
Authored: Wed Aug 24 15:59:09 2016 -0700
Committer: Felix Cheung 
Committed: Wed Aug 24 15:59:09 2016 -0700

--
 R/pkg/R/mllib.R | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/945c04bc/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index a670600..dfc5a1c 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -444,6 +444,7 @@ setMethod("write.ml", signature(object = "LDAModel", path = 
"character"),
 #' @param featureIndex The index of the feature if \code{featuresCol} is a 
vector column
 #' (default: 0), no effect otherwise
 #' @param weightCol The weight column name.
+#' @param ... additional arguments passed to the method.
 #' @return \code{spark.isoreg} returns a fitted Isotonic Regression model
 #' @rdname spark.isoreg
 #' @aliases spark.isoreg,SparkDataFrame,formula-method
@@ -504,7 +505,6 @@ setMethod("predict", signature(object = 
"IsotonicRegressionModel"),
 
 #  Get the summary of an IsotonicRegressionModel model
 
-#' @param ... Other optional arguments to summary of an IsotonicRegressionModel
 #' @return \code{summary} returns the model's boundaries and prediction as 
lists
 #' @rdname spark.isoreg
 #' @aliases summary,IsotonicRegressionModel-method
@@ -1074,6 +1074,7 @@ setMethod("predict", signature(object = 
"AFTSurvivalRegressionModel"),
 #' @param k number of independent Gaussians in the mixture model.
 #' @param maxIter maximum iteration number.
 #' @param tol the convergence tolerance.
+#' @param ... additional arguments passed to the method.
 #' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
 #' @return \code{spark.gaussianMixture} returns a fitted multivariate gaussian 
mixture model.
 #' @rdname spark.gaussianMixture
@@ -1117,7 +1118,6 @@ setMethod("spark.gaussianMixture", signature(data = 
"SparkDataFrame", formula =
 #  Get the summary of a multivariate gaussian mixture model
 
 #' @param object a fitted gaussian mixture model.
-#' @param ... currently not used argument(s) passed to the method.
 #' @return \code{summary} returns the model's lambda, mu, sigma and posterior.
 #' @aliases spark.gaussianMixture,SparkDataFrame,formula-method
 #' @rdname spark.gaussianMixture


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and dateFormat/timestampFormat option for CSV and JSON

2016-08-24 Thread hvanhovell
Repository: spark
Updated Branches:
  refs/heads/master 891ac2b91 -> 29952ed09


[SPARK-16216][SQL] Read/write timestamps and dates in ISO 8601 and 
dateFormat/timestampFormat option for CSV and JSON

## What changes were proposed in this pull request?

### Default - ISO 8601

Currently, CSV datasource is writing `Timestamp` and `Date` as numeric form and 
JSON datasource is writing both as below:

- CSV
  ```
  // TimestampType
  14144598
  // DateType
  16673
  ```

- Json

  ```
  // TimestampType
  1970-01-01 11:46:40.0
  // DateType
  1970-01-01
  ```

So, for CSV we can't read back what we write and for JSON it becomes ambiguous 
because the timezone is being missed.

So, this PR make both **write** `Timestamp` and `Date` in ISO 8601 formatted 
string (please refer the [ISO 8601 
specification](https://www.w3.org/TR/NOTE-datetime)).

- For `Timestamp` it becomes as below: (`-MM-dd'T'HH:mm:ss.SSSZZ`)

  ```
  1970-01-01T02:00:01.000-01:00
  ```

- For `Date` it becomes as below (`-MM-dd`)

  ```
  1970-01-01
  ```

### Custom date format option - `dateFormat`

This PR also adds the support to write and read dates and timestamps in a 
formatted string as below:

- **DateType**

  - With `dateFormat` option (e.g. `/MM/dd`)

```
+--+
|  date|
+--+
|2015/08/26|
|2014/10/27|
|2016/01/28|
+--+
```

### Custom date format option - `timestampFormat`

- **TimestampType**

  - With `dateFormat` option (e.g. `dd/MM/ HH:mm`)

```
++
|date|
++
|2015/08/26 18:00|
|2014/10/27 18:30|
|2016/01/28 20:00|
++
```

## How was this patch tested?

Unit tests were added in `CSVSuite` and `JsonSuite`. For JSON, existing tests 
cover the default cases.

Author: hyukjinkwon 

Closes #14279 from HyukjinKwon/SPARK-16216-json-csv.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/29952ed0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/29952ed0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/29952ed0

Branch: refs/heads/master
Commit: 29952ed096fd2a0a19079933ff691671d6f00835
Parents: 891ac2b
Author: hyukjinkwon 
Authored: Wed Aug 24 22:16:20 2016 +0200
Committer: Herman van Hovell 
Committed: Wed Aug 24 22:16:20 2016 +0200

--
 python/pyspark/sql/readwriter.py|  56 +--
 python/pyspark/sql/streaming.py |  30 +++-
 .../org/apache/spark/sql/DataFrameReader.scala  |  18 ++-
 .../org/apache/spark/sql/DataFrameWriter.scala  |  12 ++
 .../datasources/csv/CSVInferSchema.scala|  42 ++---
 .../execution/datasources/csv/CSVOptions.scala  |  15 +-
 .../execution/datasources/csv/CSVRelation.scala |  43 -
 .../datasources/json/JSONOptions.scala  |   9 ++
 .../datasources/json/JacksonGenerator.scala |  13 +-
 .../datasources/json/JacksonParser.scala|  27 +++-
 .../datasources/json/JsonFileFormat.scala   |   5 +-
 .../spark/sql/streaming/DataStreamReader.scala  |  19 ++-
 .../datasources/csv/CSVInferSchemaSuite.scala   |   4 +-
 .../execution/datasources/csv/CSVSuite.scala| 157 ++-
 .../datasources/csv/CSVTypeCastSuite.scala  |  17 +-
 .../execution/datasources/json/JsonSuite.scala  |  67 +++-
 .../datasources/json/TestJsonData.scala |   6 +
 .../sql/sources/JsonHadoopFsRelationSuite.scala |   4 +
 18 files changed, 454 insertions(+), 90 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/29952ed0/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 64de33e..3da6f49 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -156,7 +156,7 @@ class DataFrameReader(OptionUtils):
 def json(self, path, schema=None, primitivesAsString=None, 
prefersDecimal=None,
  allowComments=None, allowUnquotedFieldNames=None, 
allowSingleQuotes=None,
  allowNumericLeadingZero=None, 
allowBackslashEscapingAnyCharacter=None,
- mode=None, columnNameOfCorruptRecord=None):
+ mode=None, columnNameOfCorruptRecord=None, dateFormat=None, 
timestampFormat=None):
 """
 Loads a JSON file (one object per line) or an RDD of Strings storing 
JSON objects
 (one object per record) and returns the result as a :class`DataFrame`.
@@ -198,6 +198,14 @@ class DataFrameReader(OptionUtils):
   
``spark.sql.columnNameOfCorruptRecord``. If None is set,
   it uses the value specified in

spark git commit: [SPARK-15083][WEB UI] History Server can OOM due to unlimited TaskUIData

2016-08-24 Thread tgraves
Repository: spark
Updated Branches:
  refs/heads/master 40b30fcf4 -> 891ac2b91


[SPARK-15083][WEB UI] History Server can OOM due to unlimited TaskUIData

## What changes were proposed in this pull request?

Based on #12990 by tankkyo

Since the History Server currently loads all application's data it can OOM if 
too many applications have a significant task count. `spark.ui.trimTasks` 
(default: false) can be set to true to trim tasks by `spark.ui.retainedTasks` 
(default: 1)

(This is a "quick fix" to help those running into the problem until a update of 
how the history server loads app data can be done)

## How was this patch tested?

Manual testing and dev/run-tests

![spark-15083](https://cloud.githubusercontent.com/assets/13952758/17713694/fe82d246-63b0-11e6-9697-b87ea75ff4ef.png)

Author: Alex Bozarth 

Closes #14673 from ajbozarth/spark15083.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/891ac2b9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/891ac2b9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/891ac2b9

Branch: refs/heads/master
Commit: 891ac2b914fb6f90a62c6fbc0a3960a89d1c1d92
Parents: 40b30fc
Author: Alex Bozarth 
Authored: Wed Aug 24 14:39:41 2016 -0500
Committer: Tom Graves 
Committed: Wed Aug 24 14:39:41 2016 -0500

--
 .../apache/spark/internal/config/package.scala  |   5 +
 .../spark/ui/jobs/JobProgressListener.scala |   9 +-
 .../org/apache/spark/ui/jobs/StagePage.scala|  12 +-
 .../scala/org/apache/spark/ui/jobs/UIData.scala |   4 +-
 .../stage_task_list_w__sortBy_expectation.json  | 130 ++---
 ...ortBy_short_names___runtime_expectation.json | 130 ++---
 ...sortBy_short_names__runtime_expectation.json | 182 +--
 .../status/api/v1/AllStagesResourceSuite.scala  |   4 +-
 docs/configuration.md   |   8 +
 9 files changed, 256 insertions(+), 228 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/891ac2b9/core/src/main/scala/org/apache/spark/internal/config/package.scala
--
diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala 
b/core/src/main/scala/org/apache/spark/internal/config/package.scala
index be3dac4..47174e4 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/package.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -114,4 +114,9 @@ package object config {
   private[spark] val PYSPARK_PYTHON = ConfigBuilder("spark.pyspark.python")
 .stringConf
 .createOptional
+
+  // To limit memory usage, we only track information for a fixed number of 
tasks
+  private[spark] val UI_RETAINED_TASKS = 
ConfigBuilder("spark.ui.retainedTasks")
+.intConf
+.createWithDefault(10)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/891ac2b9/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala 
b/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala
index 491f716..d3a4f9d 100644
--- a/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala
+++ b/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala
@@ -19,12 +19,13 @@ package org.apache.spark.ui.jobs
 
 import java.util.concurrent.TimeoutException
 
-import scala.collection.mutable.{HashMap, HashSet, ListBuffer}
+import scala.collection.mutable.{HashMap, HashSet, LinkedHashMap, ListBuffer}
 
 import org.apache.spark._
 import org.apache.spark.annotation.DeveloperApi
 import org.apache.spark.executor.TaskMetrics
 import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config._
 import org.apache.spark.scheduler._
 import org.apache.spark.scheduler.SchedulingMode.SchedulingMode
 import org.apache.spark.storage.BlockManagerId
@@ -93,6 +94,7 @@ class JobProgressListener(conf: SparkConf) extends 
SparkListener with Logging {
 
   val retainedStages = conf.getInt("spark.ui.retainedStages", 
SparkUI.DEFAULT_RETAINED_STAGES)
   val retainedJobs = conf.getInt("spark.ui.retainedJobs", 
SparkUI.DEFAULT_RETAINED_JOBS)
+  val retainedTasks = conf.get(UI_RETAINED_TASKS)
 
   // We can test for memory leaks by ensuring that collections that track 
non-active jobs and
   // stages do not grow without bound and that collections for active 
jobs/stages eventually become
@@ -405,6 +407,11 @@ class JobProgressListener(conf: SparkConf) extends 
SparkListener with Logging {
   taskData.updateTaskMetrics(taskMetrics)
   taskData.errorMessage = errorMessage
 
+  // If Tasks is too large, remove 

spark git commit: [SPARK-16983][SQL] Add `prettyName` for row_number, dense_rank, percent_rank, cume_dist

2016-08-24 Thread hvanhovell
Repository: spark
Updated Branches:
  refs/heads/master 0b3a4be92 -> 40b30fcf4


[SPARK-16983][SQL] Add `prettyName` for row_number, dense_rank, percent_rank, 
cume_dist

## What changes were proposed in this pull request?

Currently, two-word window functions like `row_number`, `dense_rank`, 
`percent_rank`, and `cume_dist` are expressed without `_` in error messages. We 
had better show the correct names.

**Before**
```scala
scala> sql("select row_number()").show
java.lang.UnsupportedOperationException: Cannot evaluate expression: rownumber()
```

**After**
```scala
scala> sql("select row_number()").show
java.lang.UnsupportedOperationException: Cannot evaluate expression: 
row_number()
```

## How was this patch tested?

Pass the Jenkins and manual.

Author: Dongjoon Hyun 

Closes #14571 from dongjoon-hyun/SPARK-16983.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/40b30fcf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/40b30fcf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/40b30fcf

Branch: refs/heads/master
Commit: 40b30fcf453169534cb53d01cd22236210b13005
Parents: 0b3a4be
Author: Dongjoon Hyun 
Authored: Wed Aug 24 21:14:40 2016 +0200
Committer: Herman van Hovell 
Committed: Wed Aug 24 21:14:40 2016 +0200

--
 .../sql/catalyst/expressions/windowExpressions.scala | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/40b30fcf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index 6806591..b47486f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -477,7 +477,7 @@ object SizeBasedWindowFunction {
  the window partition.""")
 case class RowNumber() extends RowNumberLike {
   override val evaluateExpression = rowNumber
-  override def sql: String = "ROW_NUMBER()"
+  override def prettyName: String = "row_number"
 }
 
 /**
@@ -497,7 +497,7 @@ case class CumeDist() extends RowNumberLike with 
SizeBasedWindowFunction {
   // return the same value for equal values in the partition.
   override val frame = SpecifiedWindowFrame(RangeFrame, UnboundedPreceding, 
CurrentRow)
   override val evaluateExpression = Divide(Cast(rowNumber, DoubleType), 
Cast(n, DoubleType))
-  override def sql: String = "CUME_DIST()"
+  override def prettyName: String = "cume_dist"
 }
 
 /**
@@ -628,6 +628,8 @@ abstract class RankLike extends AggregateWindowFunction {
   override val updateExpressions = increaseRank +: increaseRowNumber +: 
children
   override val evaluateExpression: Expression = rank
 
+  override def sql: String = s"${prettyName.toUpperCase}()"
+
   def withOrder(order: Seq[Expression]): RankLike
 }
 
@@ -649,7 +651,6 @@ abstract class RankLike extends AggregateWindowFunction {
 case class Rank(children: Seq[Expression]) extends RankLike {
   def this() = this(Nil)
   override def withOrder(order: Seq[Expression]): Rank = Rank(order)
-  override def sql: String = "RANK()"
 }
 
 /**
@@ -674,7 +675,7 @@ case class DenseRank(children: Seq[Expression]) extends 
RankLike {
   override val updateExpressions = increaseRank +: children
   override val aggBufferAttributes = rank +: orderAttrs
   override val initialValues = zero +: orderInit
-  override def sql: String = "DENSE_RANK()"
+  override def prettyName: String = "dense_rank"
 }
 
 /**
@@ -701,5 +702,5 @@ case class PercentRank(children: Seq[Expression]) extends 
RankLike with SizeBase
   override val evaluateExpression = If(GreaterThan(n, one),
   Divide(Cast(Subtract(rank, one), DoubleType), Cast(Subtract(n, one), 
DoubleType)),
   Literal(0.0d))
-  override def sql: String = "PERCENT_RANK()"
+  override def prettyName: String = "percent_rank"
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment

2016-08-24 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 29091d7cd -> 9f924a01b


[SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same 
java used in the spark environment

## What changes were proposed in this pull request?

Update to py4j 0.10.3 to enable JAVA_HOME support

## How was this patch tested?

Pyspark tests

Author: Sean Owen 

Closes #14748 from srowen/SPARK-16781.

(cherry picked from commit 0b3a4be92ca6b38eef32ea5ca240d9f91f68aa65)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9f924a01
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9f924a01
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9f924a01

Branch: refs/heads/branch-2.0
Commit: 9f924a01b27ebba56080c9ad01b84fff026d5dcd
Parents: 29091d7
Author: Sean Owen 
Authored: Wed Aug 24 20:04:09 2016 +0100
Committer: Sean Owen 
Committed: Wed Aug 24 20:04:20 2016 +0100

--
 LICENSE|   2 +-
 bin/pyspark|   2 +-
 bin/pyspark2.cmd   |   2 +-
 core/pom.xml   |   2 +-
 .../org/apache/spark/api/python/PythonUtils.scala  |   2 +-
 dev/deps/spark-deps-hadoop-2.2 |   2 +-
 dev/deps/spark-deps-hadoop-2.3 |   2 +-
 dev/deps/spark-deps-hadoop-2.4 |   2 +-
 dev/deps/spark-deps-hadoop-2.6 |   2 +-
 dev/deps/spark-deps-hadoop-2.7 |   2 +-
 python/docs/Makefile   |   2 +-
 python/lib/py4j-0.10.1-src.zip | Bin 61356 -> 0 bytes
 python/lib/py4j-0.10.3-src.zip | Bin 0 -> 91275 bytes
 sbin/spark-config.sh   |   2 +-
 .../org/apache/spark/deploy/yarn/Client.scala  |   6 +++---
 .../spark/deploy/yarn/YarnClusterSuite.scala   |   2 +-
 16 files changed, 16 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9f924a01/LICENSE
--
diff --git a/LICENSE b/LICENSE
index 94fd46f..d68609c 100644
--- a/LICENSE
+++ b/LICENSE
@@ -263,7 +263,7 @@ The text of each license is also included at 
licenses/LICENSE-[project].txt.
  (New BSD license) Protocol Buffer Java API 
(org.spark-project.protobuf:protobuf-java:2.4.1-shaded - 
http://code.google.com/p/protobuf)
  (The BSD License) Fortran to Java ARPACK 
(net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net)
  (The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - 
http://xmlenc.sourceforge.net)
- (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.1 - 
http://py4j.sourceforge.net/)
+ (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.3 - 
http://py4j.sourceforge.net/)
  (Two-clause BSD-style license) JUnit-Interface 
(com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/)
  (BSD licence) sbt and sbt-launch-lib.bash
  (BSD 3 Clause) d3.min.js 
(https://github.com/mbostock/d3/blob/master/LICENSE)

http://git-wip-us.apache.org/repos/asf/spark/blob/9f924a01/bin/pyspark
--
diff --git a/bin/pyspark b/bin/pyspark
index ac8aa04..037645d 100755
--- a/bin/pyspark
+++ b/bin/pyspark
@@ -65,7 +65,7 @@ export PYSPARK_PYTHON
 
 # Add the PySpark classes to the Python path:
 export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
-export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.1-src.zip:$PYTHONPATH"
+export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH"
 
 # Load the PySpark shell.py script when ./pyspark is used interactively:
 export OLD_PYTHONSTARTUP="$PYTHONSTARTUP"

http://git-wip-us.apache.org/repos/asf/spark/blob/9f924a01/bin/pyspark2.cmd
--
diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd
index 3e2ff10..1217a4f 100644
--- a/bin/pyspark2.cmd
+++ b/bin/pyspark2.cmd
@@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
 )
 
 set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
-set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.1-src.zip;%PYTHONPATH%
+set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.3-src.zip;%PYTHONPATH%
 
 set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
 set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py

http://git-wip-us.apache.org/repos/asf/spark/blob/9f924a01/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index bb27ec9..208659b 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -327,7 +327,7 @@
 
   net.sf.py4j
  

spark git commit: [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment

2016-08-24 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 2fbdb6063 -> 0b3a4be92


[SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same 
java used in the spark environment

## What changes were proposed in this pull request?

Update to py4j 0.10.3 to enable JAVA_HOME support

## How was this patch tested?

Pyspark tests

Author: Sean Owen 

Closes #14748 from srowen/SPARK-16781.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0b3a4be9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0b3a4be9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0b3a4be9

Branch: refs/heads/master
Commit: 0b3a4be92ca6b38eef32ea5ca240d9f91f68aa65
Parents: 2fbdb60
Author: Sean Owen 
Authored: Wed Aug 24 20:04:09 2016 +0100
Committer: Sean Owen 
Committed: Wed Aug 24 20:04:09 2016 +0100

--
 LICENSE|   2 +-
 bin/pyspark|   2 +-
 bin/pyspark2.cmd   |   2 +-
 core/pom.xml   |   2 +-
 .../org/apache/spark/api/python/PythonUtils.scala  |   2 +-
 dev/deps/spark-deps-hadoop-2.2 |   2 +-
 dev/deps/spark-deps-hadoop-2.3 |   2 +-
 dev/deps/spark-deps-hadoop-2.4 |   2 +-
 dev/deps/spark-deps-hadoop-2.6 |   2 +-
 dev/deps/spark-deps-hadoop-2.7 |   2 +-
 python/docs/Makefile   |   2 +-
 python/lib/py4j-0.10.1-src.zip | Bin 61356 -> 0 bytes
 python/lib/py4j-0.10.3-src.zip | Bin 0 -> 91275 bytes
 sbin/spark-config.sh   |   2 +-
 .../org/apache/spark/deploy/yarn/Client.scala  |   6 +++---
 .../spark/deploy/yarn/YarnClusterSuite.scala   |   2 +-
 16 files changed, 16 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0b3a4be9/LICENSE
--
diff --git a/LICENSE b/LICENSE
index 94fd46f..d68609c 100644
--- a/LICENSE
+++ b/LICENSE
@@ -263,7 +263,7 @@ The text of each license is also included at 
licenses/LICENSE-[project].txt.
  (New BSD license) Protocol Buffer Java API 
(org.spark-project.protobuf:protobuf-java:2.4.1-shaded - 
http://code.google.com/p/protobuf)
  (The BSD License) Fortran to Java ARPACK 
(net.sourceforge.f2j:arpack_combined_all:0.1 - http://f2j.sourceforge.net)
  (The BSD License) xmlenc Library (xmlenc:xmlenc:0.52 - 
http://xmlenc.sourceforge.net)
- (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.1 - 
http://py4j.sourceforge.net/)
+ (The New BSD License) Py4J (net.sf.py4j:py4j:0.10.3 - 
http://py4j.sourceforge.net/)
  (Two-clause BSD-style license) JUnit-Interface 
(com.novocode:junit-interface:0.10 - http://github.com/szeiger/junit-interface/)
  (BSD licence) sbt and sbt-launch-lib.bash
  (BSD 3 Clause) d3.min.js 
(https://github.com/mbostock/d3/blob/master/LICENSE)

http://git-wip-us.apache.org/repos/asf/spark/blob/0b3a4be9/bin/pyspark
--
diff --git a/bin/pyspark b/bin/pyspark
index a0d7e22..7590309 100755
--- a/bin/pyspark
+++ b/bin/pyspark
@@ -57,7 +57,7 @@ export PYSPARK_PYTHON
 
 # Add the PySpark classes to the Python path:
 export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
-export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.1-src.zip:$PYTHONPATH"
+export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH"
 
 # Load the PySpark shell.py script when ./pyspark is used interactively:
 export OLD_PYTHONSTARTUP="$PYTHONSTARTUP"

http://git-wip-us.apache.org/repos/asf/spark/blob/0b3a4be9/bin/pyspark2.cmd
--
diff --git a/bin/pyspark2.cmd b/bin/pyspark2.cmd
index 3e2ff10..1217a4f 100644
--- a/bin/pyspark2.cmd
+++ b/bin/pyspark2.cmd
@@ -30,7 +30,7 @@ if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
 )
 
 set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
-set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.1-src.zip;%PYTHONPATH%
+set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.3-src.zip;%PYTHONPATH%
 
 set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
 set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py

http://git-wip-us.apache.org/repos/asf/spark/blob/0b3a4be9/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index 04b94a2..ab6c3ce 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -326,7 +326,7 @@
 
   net.sf.py4j
   py4j
-  0.10.1
+  0.10.3
 
 
   org.apache.spark


spark git commit: [SPARK-16445][MLLIB][SPARKR] Multilayer Perceptron Classifier wrapper in SparkR

2016-08-24 Thread felixcheung
Repository: spark
Updated Branches:
  refs/heads/master d2932a0e9 -> 2fbdb6063


[SPARK-16445][MLLIB][SPARKR] Multilayer Perceptron Classifier wrapper in SparkR

https://issues.apache.org/jira/browse/SPARK-16445

## What changes were proposed in this pull request?

Create Multilayer Perceptron Classifier wrapper in SparkR

## How was this patch tested?

Tested manually on local machine

Author: Xin Ren 

Closes #14447 from keypointt/SPARK-16445.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2fbdb606
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2fbdb606
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2fbdb606

Branch: refs/heads/master
Commit: 2fbdb606392631b1dff88ec86f388cc2559c28f5
Parents: d2932a0
Author: Xin Ren 
Authored: Wed Aug 24 11:18:10 2016 -0700
Committer: Felix Cheung 
Committed: Wed Aug 24 11:18:10 2016 -0700

--
 R/pkg/NAMESPACE |   1 +
 R/pkg/R/generics.R  |   4 +
 R/pkg/R/mllib.R | 125 -
 R/pkg/inst/tests/testthat/test_mllib.R  |  32 +
 .../MultilayerPerceptronClassifierWrapper.scala | 134 +++
 .../scala/org/apache/spark/ml/r/RWrappers.scala |   2 +
 6 files changed, 293 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2fbdb606/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 7090576..ad587a6 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -27,6 +27,7 @@ exportMethods("glm",
   "summary",
   "spark.kmeans",
   "fitted",
+  "spark.mlp",
   "spark.naiveBayes",
   "spark.survreg",
   "spark.lda",

http://git-wip-us.apache.org/repos/asf/spark/blob/2fbdb606/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 4e6..7e626be 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -1330,6 +1330,10 @@ setGeneric("spark.kmeans", function(data, formula, ...) 
{ standardGeneric("spark
 #' @export
 setGeneric("fitted")
 
+#' @rdname spark.mlp
+#' @export
+setGeneric("spark.mlp", function(data, ...) { standardGeneric("spark.mlp") })
+
 #' @rdname spark.naiveBayes
 #' @export
 setGeneric("spark.naiveBayes", function(data, formula, ...) { 
standardGeneric("spark.naiveBayes") })

http://git-wip-us.apache.org/repos/asf/spark/blob/2fbdb606/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index a40310d..a670600 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -60,6 +60,13 @@ setClass("AFTSurvivalRegressionModel", representation(jobj = 
"jobj"))
 #' @note KMeansModel since 2.0.0
 setClass("KMeansModel", representation(jobj = "jobj"))
 
+#' S4 class that represents a MultilayerPerceptronClassificationModel
+#'
+#' @param jobj a Java object reference to the backing Scala 
MultilayerPerceptronClassifierWrapper
+#' @export
+#' @note MultilayerPerceptronClassificationModel since 2.1.0
+setClass("MultilayerPerceptronClassificationModel", representation(jobj = 
"jobj"))
+
 #' S4 class that represents an IsotonicRegressionModel
 #'
 #' @param jobj a Java object reference to the backing Scala 
IsotonicRegressionModel
@@ -90,7 +97,7 @@ setClass("ALSModel", representation(jobj = "jobj"))
 #' @export
 #' @seealso \link{spark.glm}, \link{glm},
 #' @seealso \link{spark.als}, \link{spark.gaussianMixture}, 
\link{spark.isoreg}, \link{spark.kmeans},
-#' @seealso \link{spark.lda}, \link{spark.naiveBayes}, \link{spark.survreg},
+#' @seealso \link{spark.lda}, \link{spark.mlp}, \link{spark.naiveBayes}, 
\link{spark.survreg}
 #' @seealso \link{read.ml}
 NULL
 
@@ -103,7 +110,7 @@ NULL
 #' @export
 #' @seealso \link{spark.glm}, \link{glm},
 #' @seealso \link{spark.als}, \link{spark.gaussianMixture}, 
\link{spark.isoreg}, \link{spark.kmeans},
-#' @seealso \link{spark.naiveBayes}, \link{spark.survreg},
+#' @seealso \link{spark.mlp}, \link{spark.naiveBayes}, \link{spark.survreg}
 NULL
 
 write_internal <- function(object, path, overwrite = FALSE) {
@@ -631,6 +638,95 @@ setMethod("predict", signature(object = "KMeansModel"),
 predict_internal(object, newData)
   })
 
+#' Multilayer Perceptron Classification Model
+#'
+#' \code{spark.mlp} fits a multi-layer perceptron neural network model against 
a SparkDataFrame.
+#' Users can call \code{summary} to print a summary of the fitted model, 
\code{predict} to make
+#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load 
fitted 

spark git commit: [SPARKR][MINOR] Fix doc for show method

2016-08-24 Thread felixcheung
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 33d79b587 -> 29091d7cd


[SPARKR][MINOR] Fix doc for show method

## What changes were proposed in this pull request?

The original doc of `show` put methods for multiple classes together but the 
text only talks about `SparkDataFrame`. This PR tries to fix this problem.

## How was this patch tested?

Manual test.

Author: Junyang Qian 

Closes #14776 from junyangq/SPARK-FixShowDoc.

(cherry picked from commit d2932a0e987132c694ed59515b7c77adaad052e6)
Signed-off-by: Felix Cheung 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/29091d7c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/29091d7c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/29091d7c

Branch: refs/heads/branch-2.0
Commit: 29091d7cd60c20bf019dc9c1625a22e80ea50928
Parents: 33d79b5
Author: Junyang Qian 
Authored: Wed Aug 24 10:40:09 2016 -0700
Committer: Felix Cheung 
Committed: Wed Aug 24 10:40:26 2016 -0700

--
 R/pkg/R/DataFrame.R | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/29091d7c/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index f8a05c6..ab45d2c 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -205,9 +205,9 @@ setMethod("showDF",
 
 #' show
 #'
-#' Print the SparkDataFrame column names and types
+#' Print class and type information of a Spark object.
 #'
-#' @param object a SparkDataFrame.
+#' @param object a Spark object. Can be a SparkDataFrame, Column, GroupedData, 
WindowSpec.
 #'
 #' @family SparkDataFrame functions
 #' @rdname show


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARKR][MINOR] Fix doc for show method

2016-08-24 Thread felixcheung
Repository: spark
Updated Branches:
  refs/heads/master 45b786aca -> d2932a0e9


[SPARKR][MINOR] Fix doc for show method

## What changes were proposed in this pull request?

The original doc of `show` put methods for multiple classes together but the 
text only talks about `SparkDataFrame`. This PR tries to fix this problem.

## How was this patch tested?

Manual test.

Author: Junyang Qian 

Closes #14776 from junyangq/SPARK-FixShowDoc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d2932a0e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d2932a0e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d2932a0e

Branch: refs/heads/master
Commit: d2932a0e987132c694ed59515b7c77adaad052e6
Parents: 45b786a
Author: Junyang Qian 
Authored: Wed Aug 24 10:40:09 2016 -0700
Committer: Felix Cheung 
Committed: Wed Aug 24 10:40:09 2016 -0700

--
 R/pkg/R/DataFrame.R | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d2932a0e/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 52a6628..e12b58e 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -212,9 +212,9 @@ setMethod("showDF",
 
 #' show
 #'
-#' Print the SparkDataFrame column names and types
+#' Print class and type information of a Spark object.
 #'
-#' @param object a SparkDataFrame.
+#' @param object a Spark object. Can be a SparkDataFrame, Column, GroupedData, 
WindowSpec.
 #'
 #' @family SparkDataFrame functions
 #' @rdname show


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][DOC] Fix wrong ml.feature.Normalizer document.

2016-08-24 Thread yliang
Repository: spark
Updated Branches:
  refs/heads/master 92c0eaf34 -> 45b786aca


[MINOR][DOC] Fix wrong ml.feature.Normalizer document.

## What changes were proposed in this pull request?
The ```ml.feature.Normalizer``` examples illustrate L1 norm rather than L2, we 
should correct corresponding document.
![image](https://cloud.githubusercontent.com/assets/1962026/17928637/85aec284-69b0-11e6-9b13-d465ee560581.png)

## How was this patch tested?
Doc change, no test.

Author: Yanbo Liang 

Closes #14787 from yanboliang/normalizer.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/45b786ac
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/45b786ac
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/45b786ac

Branch: refs/heads/master
Commit: 45b786aca2b5818dc233643e6b3a53b869560563
Parents: 92c0eaf
Author: Yanbo Liang 
Authored: Wed Aug 24 08:24:16 2016 -0700
Committer: Yanbo Liang 
Committed: Wed Aug 24 08:24:16 2016 -0700

--
 docs/ml-features.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/45b786ac/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 6020114..e41bf78 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -734,7 +734,7 @@ for more details on the API.
 
 `Normalizer` is a `Transformer` which transforms a dataset of `Vector` rows, 
normalizing each `Vector` to have unit norm.  It takes parameter `p`, which 
specifies the 
[p-norm](http://en.wikipedia.org/wiki/Norm_%28mathematics%29#p-norm) used for 
normalization.  ($p = 2$ by default.)  This normalization can help standardize 
your input data and improve the behavior of learning algorithms.
 
-The following example demonstrates how to load a dataset in libsvm format and 
then normalize each row to have unit $L^2$ norm and unit $L^\infty$ norm.
+The following example demonstrates how to load a dataset in libsvm format and 
then normalize each row to have unit $L^1$ norm and unit $L^\infty$ norm.
 
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer when some quantiles are duplicated

2016-08-24 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 ce7dce175 -> 33d79b587


[SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer 
when some quantiles are duplicated

## What changes were proposed in this pull request?

In cases when QuantileDiscretizerSuite is called upon a numeric array with 
duplicated elements,  we will  take the unique elements generated from 
approxQuantiles as input for Bucketizer.

## How was this patch tested?

An unit test is added in QuantileDiscretizerSuite

QuantileDiscretizer.fit will throw an illegal exception when calling setSplits 
on a list of splits
with duplicated elements. Bucketizer.setSplits should only accept either a 
numeric vector of two
or more unique cut points, although that may produce less number of buckets 
than requested.

Signed-off-by: VinceShieh 

Author: VinceShieh 

Closes #14747 from VinceShieh/SPARK-17086.

(cherry picked from commit 92c0eaf348b42b3479610da0be761013f9d81c54)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/33d79b58
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/33d79b58
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/33d79b58

Branch: refs/heads/branch-2.0
Commit: 33d79b58735770ac613540c21095a1e404f065b0
Parents: ce7dce1
Author: VinceShieh 
Authored: Wed Aug 24 10:16:58 2016 +0100
Committer: Sean Owen 
Committed: Wed Aug 24 13:46:40 2016 +0100

--
 .../spark/ml/feature/QuantileDiscretizer.scala   |  7 ++-
 .../ml/feature/QuantileDiscretizerSuite.scala| 19 +++
 2 files changed, 25 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/33d79b58/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
index 558a7bb..e098008 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
@@ -114,7 +114,12 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val bucketizer = new Bucketizer(uid).setSplits(splits)
+val distinctSplits = splits.distinct
+if (splits.length != distinctSplits.length) {
+  log.warn(s"Some quantiles were identical. Bucketing to 
${distinctSplits.length - 1}" +
+s" buckets as a result.")
+}
+val bucketizer = new Bucketizer(uid).setSplits(distinctSplits.sorted)
 copyValues(bucketizer.setParent(this))
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/33d79b58/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
index b73dbd6..18f1e89 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
@@ -52,6 +52,25 @@ class QuantileDiscretizerSuite
   "Bucket sizes are not within expected relative error tolerance.")
   }
 
+  test("Test Bucketizer on duplicated splits") {
+val spark = this.spark
+import spark.implicits._
+
+val datasetSize = 12
+val numBuckets = 5
+val df = sc.parallelize(Array(1.0, 3.0, 2.0, 1.0, 1.0, 2.0, 3.0, 2.0, 2.0, 
2.0, 1.0, 3.0))
+  .map(Tuple1.apply).toDF("input")
+val discretizer = new QuantileDiscretizer()
+  .setInputCol("input")
+  .setOutputCol("result")
+  .setNumBuckets(numBuckets)
+val result = discretizer.fit(df).transform(df)
+
+val observedNumBuckets = result.select("result").distinct.count
+assert(2 <= observedNumBuckets && observedNumBuckets <= numBuckets,
+  "Observed number of buckets are not within expected range.")
+  }
+
   test("Test transform method on unseen data") {
 val spark = this.spark
 import spark.implicits._


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer when some quantiles are duplicated

2016-08-24 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 673a80d22 -> 92c0eaf34


[SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer 
when some quantiles are duplicated

## What changes were proposed in this pull request?

In cases when QuantileDiscretizerSuite is called upon a numeric array with 
duplicated elements,  we will  take the unique elements generated from 
approxQuantiles as input for Bucketizer.

## How was this patch tested?

An unit test is added in QuantileDiscretizerSuite

QuantileDiscretizer.fit will throw an illegal exception when calling setSplits 
on a list of splits
with duplicated elements. Bucketizer.setSplits should only accept either a 
numeric vector of two
or more unique cut points, although that may produce less number of buckets 
than requested.

Signed-off-by: VinceShieh 

Author: VinceShieh 

Closes #14747 from VinceShieh/SPARK-17086.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/92c0eaf3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/92c0eaf3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/92c0eaf3

Branch: refs/heads/master
Commit: 92c0eaf348b42b3479610da0be761013f9d81c54
Parents: 673a80d
Author: VinceShieh 
Authored: Wed Aug 24 10:16:58 2016 +0100
Committer: Sean Owen 
Committed: Wed Aug 24 10:16:58 2016 +0100

--
 .../spark/ml/feature/QuantileDiscretizer.scala   |  7 ++-
 .../ml/feature/QuantileDiscretizerSuite.scala| 19 +++
 2 files changed, 25 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/92c0eaf3/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
index 558a7bb..e098008 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
@@ -114,7 +114,12 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
 
-val bucketizer = new Bucketizer(uid).setSplits(splits)
+val distinctSplits = splits.distinct
+if (splits.length != distinctSplits.length) {
+  log.warn(s"Some quantiles were identical. Bucketing to 
${distinctSplits.length - 1}" +
+s" buckets as a result.")
+}
+val bucketizer = new Bucketizer(uid).setSplits(distinctSplits.sorted)
 copyValues(bucketizer.setParent(this))
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/92c0eaf3/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
index b73dbd6..18f1e89 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
@@ -52,6 +52,25 @@ class QuantileDiscretizerSuite
   "Bucket sizes are not within expected relative error tolerance.")
   }
 
+  test("Test Bucketizer on duplicated splits") {
+val spark = this.spark
+import spark.implicits._
+
+val datasetSize = 12
+val numBuckets = 5
+val df = sc.parallelize(Array(1.0, 3.0, 2.0, 1.0, 1.0, 2.0, 3.0, 2.0, 2.0, 
2.0, 1.0, 3.0))
+  .map(Tuple1.apply).toDF("input")
+val discretizer = new QuantileDiscretizer()
+  .setInputCol("input")
+  .setOutputCol("result")
+  .setNumBuckets(numBuckets)
+val result = discretizer.fit(df).transform(df)
+
+val observedNumBuckets = result.select("result").distinct.count
+assert(2 <= observedNumBuckets && observedNumBuckets <= numBuckets,
+  "Observed number of buckets are not within expected range.")
+  }
+
   test("Test transform method on unseen data") {
 val spark = this.spark
 import spark.implicits._


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][BUILD] Fix Java CheckStyle Error

2016-08-24 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 df87f161c -> ce7dce175


[MINOR][BUILD] Fix Java CheckStyle Error

As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), 
besides the critical bugs, it's better to fix the code style errors before the 
release.

Before:
```
./dev/lint-java
Checkstyle checks failed at following occurrences:
[ERROR] 
src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525]
 (sizes) LineLength: Line is longer than 100 characters (found 119).
[ERROR] 
src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64]
 (sizes) LineLength: Line is longer than 100 characters (found 103).
```
After:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
Manual.

Author: Weiqing Yang 

Closes #14768 from Sherry302/fixjavastyle.

(cherry picked from commit 673a80d2230602c9e6573a23e35fb0f6b832bfca)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ce7dce17
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ce7dce17
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ce7dce17

Branch: refs/heads/branch-2.0
Commit: ce7dce1755a8d36ec7346adc3de26d8fdc4f05e9
Parents: df87f16
Author: Weiqing Yang 
Authored: Wed Aug 24 10:12:44 2016 +0100
Committer: Sean Owen 
Committed: Wed Aug 24 10:15:53 2016 +0100

--
 .../spark/util/collection/unsafe/sort/UnsafeExternalSorter.java   | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ce7dce17/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
--
diff --git 
a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
 
b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
index 0d67167..999ded4 100644
--- 
a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
+++ 
b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
@@ -521,7 +521,8 @@ public final class UnsafeExternalSorter extends 
MemoryConsumer {
   // is accessing the current record. We free this page in that 
caller's next loadNext()
   // call.
   for (MemoryBlock page : allocatedPages) {
-if (!loaded || page.pageNumber != 
((UnsafeInMemorySorter.SortedIterator)upstream).getCurrentPageNumber()) {
+if (!loaded || page.pageNumber !=
+
((UnsafeInMemorySorter.SortedIterator)upstream).getCurrentPageNumber()) {
   released += page.size();
   freePage(page);
 } else {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][BUILD] Fix Java CheckStyle Error

2016-08-24 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 52fa45d62 -> 673a80d22


[MINOR][BUILD] Fix Java CheckStyle Error

## What changes were proposed in this pull request?
As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), 
besides the critical bugs, it's better to fix the code style errors before the 
release.

Before:
```
./dev/lint-java
Checkstyle checks failed at following occurrences:
[ERROR] 
src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525]
 (sizes) LineLength: Line is longer than 100 characters (found 119).
[ERROR] 
src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64]
 (sizes) LineLength: Line is longer than 100 characters (found 103).
```
After:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
## How was this patch tested?
Manual.

Author: Weiqing Yang 

Closes #14768 from Sherry302/fixjavastyle.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/673a80d2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/673a80d2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/673a80d2

Branch: refs/heads/master
Commit: 673a80d2230602c9e6573a23e35fb0f6b832bfca
Parents: 52fa45d
Author: Weiqing Yang 
Authored: Wed Aug 24 10:12:44 2016 +0100
Committer: Sean Owen 
Committed: Wed Aug 24 10:12:44 2016 +0100

--
 .../collection/unsafe/sort/UnsafeExternalSorter.java |  3 ++-
 .../sql/streaming/JavaStructuredNetworkWordCount.java| 11 ++-
 2 files changed, 8 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/673a80d2/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
--
diff --git 
a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
 
b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
index ccf7664..196e67d 100644
--- 
a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
+++ 
b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
@@ -522,7 +522,8 @@ public final class UnsafeExternalSorter extends 
MemoryConsumer {
   // is accessing the current record. We free this page in that 
caller's next loadNext()
   // call.
   for (MemoryBlock page : allocatedPages) {
-if (!loaded || page.pageNumber != 
((UnsafeInMemorySorter.SortedIterator)upstream).getCurrentPageNumber()) {
+if (!loaded || page.pageNumber !=
+
((UnsafeInMemorySorter.SortedIterator)upstream).getCurrentPageNumber()) {
   released += page.size();
   freePage(page);
 } else {

http://git-wip-us.apache.org/repos/asf/spark/blob/673a80d2/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java
--
diff --git 
a/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java
 
b/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java
index c913ee0..5f342e1 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java
@@ -61,11 +61,12 @@ public final class JavaStructuredNetworkWordCount {
   .load();
 
 // Split the lines into words
-Dataset words = lines.as(Encoders.STRING()).flatMap(new 
FlatMapFunction() {
-  @Override
-  public Iterator call(String x) {
-return Arrays.asList(x.split(" ")).iterator();
-  }
+Dataset words = lines.as(Encoders.STRING())
+  .flatMap(new FlatMapFunction() {
+@Override
+public Iterator call(String x) {
+  return Arrays.asList(x.split(" ")).iterator();
+}
 }, Encoders.STRING());
 
 // Generate running word count


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-17186][SQL] remove catalog table type INDEX

2016-08-24 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 a6e6a047b -> df87f161c


[SPARK-17186][SQL] remove catalog table type INDEX

## What changes were proposed in this pull request?

Actually Spark SQL doesn't support index, the catalog table type `INDEX` is 
from Hive. However, most operations in Spark SQL can't handle index table, e.g. 
create table, alter table, etc.

Logically index table should be invisible to end users, and Hive also generates 
special table name for index table to avoid users accessing it directly. Hive 
has special SQL syntax to create/show/drop index tables.

At Spark SQL side, although we can describe index table directly, but the 
result is unreadable, we should use the dedicated SQL syntax to do it(e.g. 
`SHOW INDEX ON tbl`). Spark SQL can also read index table directly, but the 
result is always empty.(Can hive read index table directly?)

This PR remove the table type `INDEX`, to make it clear that Spark SQL doesn't 
support index currently.

## How was this patch tested?

existing tests.

Author: Wenchen Fan 

Closes #14752 from cloud-fan/minor2.

(cherry picked from commit 52fa45d62a5a0bc832442f38f9e634c5d8e29e08)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/df87f161
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/df87f161
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/df87f161

Branch: refs/heads/branch-2.0
Commit: df87f161c9e40a49235ea722f6a662a488b41c4c
Parents: a6e6a04
Author: Wenchen Fan 
Authored: Tue Aug 23 23:46:09 2016 -0700
Committer: Reynold Xin 
Committed: Tue Aug 23 23:46:17 2016 -0700

--
 .../org/apache/spark/sql/catalyst/catalog/interface.scala| 1 -
 .../org/apache/spark/sql/execution/command/tables.scala  | 8 +++-
 .../scala/org/apache/spark/sql/hive/MetastoreRelation.scala  | 1 -
 .../org/apache/spark/sql/hive/client/HiveClientImpl.scala| 4 ++--
 .../apache/spark/sql/hive/execution/HiveCommandSuite.scala   | 2 +-
 5 files changed, 6 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/df87f161/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
index 6197aca..c083cf6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
@@ -203,7 +203,6 @@ case class CatalogTableType private(name: String)
 object CatalogTableType {
   val EXTERNAL = new CatalogTableType("EXTERNAL")
   val MANAGED = new CatalogTableType("MANAGED")
-  val INDEX = new CatalogTableType("INDEX")
   val VIEW = new CatalogTableType("VIEW")
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/df87f161/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
index b2300b4..a5ccbcf 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
@@ -678,12 +678,11 @@ case class ShowPartitionsCommand(
  * Validate and throws an [[AnalysisException]] exception under the 
following conditions:
  * 1. If the table is not partitioned.
  * 2. If it is a datasource table.
- * 3. If it is a view or index table.
+ * 3. If it is a view.
  */
-if (tab.tableType == VIEW ||
-  tab.tableType == INDEX) {
+if (tab.tableType == VIEW) {
   throw new AnalysisException(
-s"SHOW PARTITIONS is not allowed on a view or index table: 
${tab.qualifiedName}")
+s"SHOW PARTITIONS is not allowed on a view: ${tab.qualifiedName}")
 }
 
 if (!DDLUtils.isTablePartitioned(tab)) {
@@ -765,7 +764,6 @@ case class ShowCreateTableCommand(table: TableIdentifier) 
extends RunnableComman
   case EXTERNAL => " EXTERNAL TABLE"
   case VIEW => " VIEW"
   case MANAGED => " TABLE"
-  case INDEX => reportUnsupportedError(Seq("index table"))
 }
 
 builder ++= s"CREATE$tableTypeString ${table.quotedString}"

http://git-wip-us.apache.org/repos/asf/spark/blob/df87f161/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala

spark git commit: [SPARK-17186][SQL] remove catalog table type INDEX

2016-08-24 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master b9994ad05 -> 52fa45d62


[SPARK-17186][SQL] remove catalog table type INDEX

## What changes were proposed in this pull request?

Actually Spark SQL doesn't support index, the catalog table type `INDEX` is 
from Hive. However, most operations in Spark SQL can't handle index table, e.g. 
create table, alter table, etc.

Logically index table should be invisible to end users, and Hive also generates 
special table name for index table to avoid users accessing it directly. Hive 
has special SQL syntax to create/show/drop index tables.

At Spark SQL side, although we can describe index table directly, but the 
result is unreadable, we should use the dedicated SQL syntax to do it(e.g. 
`SHOW INDEX ON tbl`). Spark SQL can also read index table directly, but the 
result is always empty.(Can hive read index table directly?)

This PR remove the table type `INDEX`, to make it clear that Spark SQL doesn't 
support index currently.

## How was this patch tested?

existing tests.

Author: Wenchen Fan 

Closes #14752 from cloud-fan/minor2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/52fa45d6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/52fa45d6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/52fa45d6

Branch: refs/heads/master
Commit: 52fa45d62a5a0bc832442f38f9e634c5d8e29e08
Parents: b9994ad
Author: Wenchen Fan 
Authored: Tue Aug 23 23:46:09 2016 -0700
Committer: Reynold Xin 
Committed: Tue Aug 23 23:46:09 2016 -0700

--
 .../org/apache/spark/sql/catalyst/catalog/interface.scala| 1 -
 .../org/apache/spark/sql/execution/command/tables.scala  | 8 +++-
 .../scala/org/apache/spark/sql/hive/MetastoreRelation.scala  | 1 -
 .../org/apache/spark/sql/hive/client/HiveClientImpl.scala| 4 ++--
 .../apache/spark/sql/hive/execution/HiveCommandSuite.scala   | 2 +-
 5 files changed, 6 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/52fa45d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
index f7762e0..83e01f9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
@@ -200,7 +200,6 @@ case class CatalogTableType private(name: String)
 object CatalogTableType {
   val EXTERNAL = new CatalogTableType("EXTERNAL")
   val MANAGED = new CatalogTableType("MANAGED")
-  val INDEX = new CatalogTableType("INDEX")
   val VIEW = new CatalogTableType("VIEW")
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/52fa45d6/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
index 21544a3..b4a15b8 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
@@ -620,12 +620,11 @@ case class ShowPartitionsCommand(
  * Validate and throws an [[AnalysisException]] exception under the 
following conditions:
  * 1. If the table is not partitioned.
  * 2. If it is a datasource table.
- * 3. If it is a view or index table.
+ * 3. If it is a view.
  */
-if (tab.tableType == VIEW ||
-  tab.tableType == INDEX) {
+if (tab.tableType == VIEW) {
   throw new AnalysisException(
-s"SHOW PARTITIONS is not allowed on a view or index table: 
${tab.qualifiedName}")
+s"SHOW PARTITIONS is not allowed on a view: ${tab.qualifiedName}")
 }
 
 if (tab.partitionColumnNames.isEmpty) {
@@ -708,7 +707,6 @@ case class ShowCreateTableCommand(table: TableIdentifier) 
extends RunnableComman
   case EXTERNAL => " EXTERNAL TABLE"
   case VIEW => " VIEW"
   case MANAGED => " TABLE"
-  case INDEX => reportUnsupportedError(Seq("index table"))
 }
 
 builder ++= s"CREATE$tableTypeString ${table.quotedString}"

http://git-wip-us.apache.org/repos/asf/spark/blob/52fa45d6/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala 

spark git commit: [MINOR][SQL] Remove implemented functions from comments of 'HiveSessionCatalog.scala'

2016-08-24 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 a772b4b5d -> a6e6a047b


[MINOR][SQL] Remove implemented functions from comments of 
'HiveSessionCatalog.scala'

## What changes were proposed in this pull request?
This PR removes implemented functions from comments of 
`HiveSessionCatalog.scala`: `java_method`, `posexplode`, `str_to_map`.

## How was this patch tested?
Manual.

Author: Weiqing Yang 

Closes #14769 from Sherry302/cleanComment.

(cherry picked from commit b9994ad05628077016331e6b411fbc09017b1e63)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a6e6a047
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a6e6a047
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a6e6a047

Branch: refs/heads/branch-2.0
Commit: a6e6a047bb9215df55b009957d4c560624d886fc
Parents: a772b4b
Author: Weiqing Yang 
Authored: Tue Aug 23 23:44:45 2016 -0700
Committer: Reynold Xin 
Committed: Tue Aug 23 23:45:00 2016 -0700

--
 .../scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala   | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a6e6a047/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
index c59ac3d..1684e8d 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
@@ -230,10 +230,8 @@ private[sql] class HiveSessionCatalog(
   // List of functions we are explicitly not supporting are:
   // compute_stats, context_ngrams, create_union,
   // current_user, ewah_bitmap, ewah_bitmap_and, ewah_bitmap_empty, 
ewah_bitmap_or, field,
-  // in_file, index, java_method,
-  // matchpath, ngrams, noop, noopstreaming, noopwithmap, noopwithmapstreaming,
-  // parse_url_tuple, posexplode, reflect2,
-  // str_to_map, windowingtablefunction.
+  // in_file, index, matchpath, ngrams, noop, noopstreaming, noopwithmap,
+  // noopwithmapstreaming, parse_url_tuple, reflect2, windowingtablefunction.
   private val hiveFunctions = Seq(
 "hash",
 "histogram_numeric",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [MINOR][SQL] Remove implemented functions from comments of 'HiveSessionCatalog.scala'

2016-08-24 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master c1937dd19 -> b9994ad05


[MINOR][SQL] Remove implemented functions from comments of 
'HiveSessionCatalog.scala'

## What changes were proposed in this pull request?
This PR removes implemented functions from comments of 
`HiveSessionCatalog.scala`: `java_method`, `posexplode`, `str_to_map`.

## How was this patch tested?
Manual.

Author: Weiqing Yang 

Closes #14769 from Sherry302/cleanComment.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b9994ad0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b9994ad0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b9994ad0

Branch: refs/heads/master
Commit: b9994ad05628077016331e6b411fbc09017b1e63
Parents: c1937dd
Author: Weiqing Yang 
Authored: Tue Aug 23 23:44:45 2016 -0700
Committer: Reynold Xin 
Committed: Tue Aug 23 23:44:45 2016 -0700

--
 .../scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala   | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b9994ad0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
index ebed9eb..ca8c734 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala
@@ -230,10 +230,8 @@ private[sql] class HiveSessionCatalog(
   // List of functions we are explicitly not supporting are:
   // compute_stats, context_ngrams, create_union,
   // current_user, ewah_bitmap, ewah_bitmap_and, ewah_bitmap_empty, 
ewah_bitmap_or, field,
-  // in_file, index, java_method,
-  // matchpath, ngrams, noop, noopstreaming, noopwithmap, noopwithmapstreaming,
-  // parse_url_tuple, posexplode, reflect2,
-  // str_to_map, windowingtablefunction.
+  // in_file, index, matchpath, ngrams, noop, noopstreaming, noopwithmap,
+  // noopwithmapstreaming, parse_url_tuple, reflect2, windowingtablefunction.
   private val hiveFunctions = Seq(
 "hash",
 "histogram_numeric",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org