date:20151211

spark git commit: [SPARK-12273][STREAMING] Make Spark Streaming web UI list Receivers in order

2015-12-11 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master aa305dcaf -> 713e6959d


[SPARK-12273][STREAMING] Make Spark Streaming web UI list Receivers in order

Currently the Streaming web UI does NOT list Receivers in order; however, it 
seems more convenient for the users if Receivers are listed in order.

![spark-12273](https://cloud.githubusercontent.com/assets/15843379/11736602/0bb7f7a8-a00b-11e5-8e86-96ba9297fb12.png)

Author: proflin 

Closes #10264 from proflin/Spark-12273.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/713e6959
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/713e6959
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/713e6959

Branch: refs/heads/master
Commit: 713e6959d21d24382ef99bbd7e9da751a7ed388c
Parents: aa305dc
Author: proflin 
Authored: Fri Dec 11 13:50:36 2015 -0800
Committer: Shixiong Zhu 
Committed: Fri Dec 11 13:50:36 2015 -0800

--
 .../scala/org/apache/spark/streaming/ui/StreamingPage.scala | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/713e6959/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala 
b/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala
index 4588b21..88a4483 100644
--- a/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala
+++ b/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala
@@ -392,8 +392,9 @@ private[ui] class StreamingPage(parent: StreamingTab)
   maxX: Long,
   minY: Double,
   maxY: Double): Seq[Node] = {
-val content = listener.receivedEventRateWithBatchTime.map { case 
(streamId, eventRates) =>
-  generateInputDStreamRow(jsCollector, streamId, eventRates, minX, maxX, 
minY, maxY)
+val content = 
listener.receivedEventRateWithBatchTime.toList.sortBy(_._1).map {
+  case (streamId, eventRates) =>
+generateInputDStreamRow(jsCollector, streamId, eventRates, minX, maxX, 
minY, maxY)
 }.foldLeft[Seq[Node]](Nil)(_ ++ _)
 
 // scalastyle:off


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue

2015-12-11 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 2ddd10486 -> bfcc8cfee


[SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure 
Issue

As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our 
PySpark `RowMatrix` constructor.  As discussed on the dev list 
[here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html),
 there appears to be an issue with type erasure with RDDs coming from Java, and 
by extension from PySpark.  Although we are attempting to construct a 
`RowMatrix` from an `RDD[Vector]` in 
[PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115),
 the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when 
calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` 
in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the 
aforementioned dev list thread, this issue was also encountered with 
`DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a 
`Vector` type.  `IndexedRowMatrix` and `CoordinateMatri
 x` do not appear to have this issue likely due to their related helper 
functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with 
pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the `createRowMatrix` 
helper function in `PythonMLlibAPI`.  This PR blocks #9441, so once this is 
merged, the other can be rebased.

cc holdenk

Author: Mike Dusenberry 

Closes #9458 from 
dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.

(cherry picked from commit 1b8220387e6903564f765fabb54be0420c3e99d7)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bfcc8cfe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bfcc8cfe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bfcc8cfe

Branch: refs/heads/branch-1.6
Commit: bfcc8cfee7219e63d2f53fc36627f95dc60428eb
Parents: 2ddd104
Author: Mike Dusenberry 
Authored: Fri Dec 11 14:21:33 2015 -0800
Committer: Joseph K. Bradley 
Committed: Fri Dec 11 14:21:48 2015 -0800

--
 .../scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bfcc8cfe/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 2aa6aec..8d546e3 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -1143,7 +1143,7 @@ private[python] class PythonMLLibAPI extends Serializable 
{
* Wrapper around RowMatrix constructor.
*/
   def createRowMatrix(rows: JavaRDD[Vector], numRows: Long, numCols: Int): 
RowMatrix = {
-new RowMatrix(rows.rdd, numRows, numCols)
+new RowMatrix(rows.rdd.retag(classOf[Vector]), numRows, numCols)
   }
 
   /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue

2015-12-11 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 5e603a51c -> e4cf12118


[SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure 
Issue

As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our 
PySpark `RowMatrix` constructor.  As discussed on the dev list 
[here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html),
 there appears to be an issue with type erasure with RDDs coming from Java, and 
by extension from PySpark.  Although we are attempting to construct a 
`RowMatrix` from an `RDD[Vector]` in 
[PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115),
 the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when 
calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` 
in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the 
aforementioned dev list thread, this issue was also encountered with 
`DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a 
`Vector` type.  `IndexedRowMatrix` and `CoordinateMatri
 x` do not appear to have this issue likely due to their related helper 
functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with 
pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the `createRowMatrix` 
helper function in `PythonMLlibAPI`.  This PR blocks #9441, so once this is 
merged, the other can be rebased.

cc holdenk

Author: Mike Dusenberry 

Closes #9458 from 
dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.

(cherry picked from commit 1b8220387e6903564f765fabb54be0420c3e99d7)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e4cf1211
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e4cf1211
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e4cf1211

Branch: refs/heads/branch-1.5
Commit: e4cf1211803097eb3cdd93deccb7eb996e12da3d
Parents: 5e603a5
Author: Mike Dusenberry 
Authored: Fri Dec 11 14:21:33 2015 -0800
Committer: Joseph K. Bradley 
Committed: Fri Dec 11 14:22:37 2015 -0800

--
 .../scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e4cf1211/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 06e13b7..2f8b5e5 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -1101,7 +1101,7 @@ private[python] class PythonMLLibAPI extends Serializable 
{
* Wrapper around RowMatrix constructor.
*/
   def createRowMatrix(rows: JavaRDD[Vector], numRows: Long, numCols: Int): 
RowMatrix = {
-new RowMatrix(rows.rdd, numRows, numCols)
+new RowMatrix(rows.rdd.retag(classOf[Vector]), numRows, numCols)
   }
 
   /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue

2015-12-11 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 713e6959d -> 1b8220387


[SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure 
Issue

As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our 
PySpark `RowMatrix` constructor.  As discussed on the dev list 
[here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html),
 there appears to be an issue with type erasure with RDDs coming from Java, and 
by extension from PySpark.  Although we are attempting to construct a 
`RowMatrix` from an `RDD[Vector]` in 
[PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115),
 the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when 
calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` 
in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the 
aforementioned dev list thread, this issue was also encountered with 
`DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a 
`Vector` type.  `IndexedRowMatrix` and `CoordinateMatri
 x` do not appear to have this issue likely due to their related helper 
functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with 
pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the `createRowMatrix` 
helper function in `PythonMLlibAPI`.  This PR blocks #9441, so once this is 
merged, the other can be rebased.

cc holdenk

Author: Mike Dusenberry 

Closes #9458 from 
dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1b822038
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1b822038
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1b822038

Branch: refs/heads/master
Commit: 1b8220387e6903564f765fabb54be0420c3e99d7
Parents: 713e695
Author: Mike Dusenberry 
Authored: Fri Dec 11 14:21:33 2015 -0800
Committer: Joseph K. Bradley 
Committed: Fri Dec 11 14:21:33 2015 -0800

--
 .../scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1b822038/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 2aa6aec..8d546e3 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -1143,7 +1143,7 @@ private[python] class PythonMLLibAPI extends Serializable 
{
* Wrapper around RowMatrix constructor.
*/
   def createRowMatrix(rows: JavaRDD[Vector], numRows: Long, numCols: Int): 
RowMatrix = {
-new RowMatrix(rows.rdd, numRows, numCols)
+new RowMatrix(rows.rdd.retag(classOf[Vector]), numRows, numCols)
   }
 
   /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12217][ML] Document invalid handling for StringIndexer

2015-12-11 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 bfcc8cfee -> 75531c77e


[SPARK-12217][ML] Document invalid handling for StringIndexer

Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features 
documentation.

I wonder if I should also add a snippet to the code example, input welcome.

Author: BenFradet 

Closes #10257 from BenFradet/SPARK-12217.

(cherry picked from commit aea676ca2d07c72b1a752e9308c961118e5bfc3c)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/75531c77
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/75531c77
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/75531c77

Branch: refs/heads/branch-1.6
Commit: 75531c77e85073c7be18985a54c623710894d861
Parents: bfcc8cf
Author: BenFradet 
Authored: Fri Dec 11 15:43:00 2015 -0800
Committer: Joseph K. Bradley 
Committed: Fri Dec 11 15:43:09 2015 -0800

--
 docs/ml-features.md | 36 
 1 file changed, 36 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/75531c77/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 6494fed..8b00cc6 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -459,6 +459,42 @@ column, we should get the following:
 "a" gets index `0` because it is the most frequent, followed by "c" with index 
`1` and "b" with
 index `2`.
 
+Additionaly, there are two strategies regarding how `StringIndexer` will handle
+unseen labels when you have fit a `StringIndexer` on one dataset and then use 
it
+to transform another:
+
+- throw an exception (which is the default)
+- skip the row containing the unseen label entirely
+
+**Examples**
+
+Let's go back to our previous example but this time reuse our previously 
defined
+`StringIndexer` on the following dataset:
+
+
+ id | category
+|--
+ 0  | a
+ 1  | b
+ 2  | c
+ 3  | d
+
+
+If you've not set how `StringIndexer` handles unseen labels or set it to
+"error", an exception will be thrown.
+However, if you had called `setHandleInvalid("skip")`, the following dataset
+will be generated:
+
+
+ id | category | categoryIndex
+|--|---
+ 0  | a| 0.0
+ 1  | b| 2.0
+ 2  | c| 1.0
+
+
+Notice that the row containing "d" does not appear.
+
 
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12217][ML] Document invalid handling for StringIndexer

2015-12-11 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 1b8220387 -> aea676ca2


[SPARK-12217][ML] Document invalid handling for StringIndexer

Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features 
documentation.

I wonder if I should also add a snippet to the code example, input welcome.

Author: BenFradet 

Closes #10257 from BenFradet/SPARK-12217.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aea676ca
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aea676ca
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aea676ca

Branch: refs/heads/master
Commit: aea676ca2d07c72b1a752e9308c961118e5bfc3c
Parents: 1b82203
Author: BenFradet 
Authored: Fri Dec 11 15:43:00 2015 -0800
Committer: Joseph K. Bradley 
Committed: Fri Dec 11 15:43:00 2015 -0800

--
 docs/ml-features.md | 36 
 1 file changed, 36 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aea676ca/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 6494fed..8b00cc6 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -459,6 +459,42 @@ column, we should get the following:
 "a" gets index `0` because it is the most frequent, followed by "c" with index 
`1` and "b" with
 index `2`.
 
+Additionaly, there are two strategies regarding how `StringIndexer` will handle
+unseen labels when you have fit a `StringIndexer` on one dataset and then use 
it
+to transform another:
+
+- throw an exception (which is the default)
+- skip the row containing the unseen label entirely
+
+**Examples**
+
+Let's go back to our previous example but this time reuse our previously 
defined
+`StringIndexer` on the following dataset:
+
+
+ id | category
+|--
+ 0  | a
+ 1  | b
+ 2  | c
+ 3  | d
+
+
+If you've not set how `StringIndexer` handles unseen labels or set it to
+"error", an exception will be thrown.
+However, if you had called `setHandleInvalid("skip")`, the following dataset
+will be generated:
+
+
+ id | category | categoryIndex
+|--|---
+ 0  | a| 0.0
+ 1  | b| 2.0
+ 2  | c| 1.0
+
+
+Notice that the row containing "d" does not appear.
+
 
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases

2015-12-11 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 03d801587 -> 47461fea7


[SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases

The existing sample functions miss the parameter `seed`, however, the 
corresponding function interface in `generics` has such a parameter. Thus, 
although the function caller can call the function with the 'seed', we are not 
using the value.

This could cause SparkR unit tests failed. For example, I hit it in another PR:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull

Author: gatorsmile 

Closes #10160 from gatorsmile/sampleR.

(cherry picked from commit 1e3526c2d3de723225024fedd45753b556e18fc6)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/47461fea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/47461fea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/47461fea

Branch: refs/heads/branch-1.6
Commit: 47461fea7c079819de6add308f823c7a8294f891
Parents: 03d8015
Author: gatorsmile 
Authored: Fri Dec 11 20:55:16 2015 -0800
Committer: Shivaram Venkataraman 
Committed: Fri Dec 11 20:55:24 2015 -0800

--
 R/pkg/R/DataFrame.R   | 17 +++--
 R/pkg/inst/tests/testthat/test_sparkSQL.R |  4 
 2 files changed, 15 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/47461fea/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 975b058..764597d 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -662,6 +662,7 @@ setMethod("unique",
 #' @param x A SparkSQL DataFrame
 #' @param withReplacement Sampling with replacement or not
 #' @param fraction The (rough) sample target fraction
+#' @param seed Randomness seed value
 #'
 #' @family DataFrame functions
 #' @rdname sample
@@ -677,13 +678,17 @@ setMethod("unique",
 #' collect(sample(df, TRUE, 0.5))
 #'}
 setMethod("sample",
-  # TODO : Figure out how to send integer as java.lang.Long to JVM so
-  # we can send seed as an argument through callJMethod
   signature(x = "DataFrame", withReplacement = "logical",
 fraction = "numeric"),
-  function(x, withReplacement, fraction) {
+  function(x, withReplacement, fraction, seed) {
 if (fraction < 0.0) stop(cat("Negative fraction value:", fraction))
-sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction)
+if (!missing(seed)) {
+  # TODO : Figure out how to send integer as java.lang.Long to JVM 
so
+  # we can send seed as an argument through callJMethod
+  sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction, 
as.integer(seed))
+} else {
+  sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction)
+}
 dataFrame(sdf)
   })
 
@@ -692,8 +697,8 @@ setMethod("sample",
 setMethod("sample_frac",
   signature(x = "DataFrame", withReplacement = "logical",
 fraction = "numeric"),
-  function(x, withReplacement, fraction) {
-sample(x, withReplacement, fraction)
+  function(x, withReplacement, fraction, seed) {
+sample(x, withReplacement, fraction, seed)
   })
 
 #' nrow

http://git-wip-us.apache.org/repos/asf/spark/blob/47461fea/R/pkg/inst/tests/testthat/test_sparkSQL.R
--
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R 
b/R/pkg/inst/tests/testthat/test_sparkSQL.R
index ed9b2c9..071fd31 100644
--- a/R/pkg/inst/tests/testthat/test_sparkSQL.R
+++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R
@@ -724,6 +724,10 @@ test_that("sample on a DataFrame", {
   sampled2 <- sample(df, FALSE, 0.1, 0) # set seed for predictable result
   expect_true(count(sampled2) < 3)
 
+  count1 <- count(sample(df, FALSE, 0.1, 0))
+  count2 <- count(sample(df, FALSE, 0.1, 0))
+  expect_equal(count1, count2)
+
   # Also test sample_frac
   sampled3 <- sample_frac(df, FALSE, 0.1, 0) # set seed for predictable result
   expect_true(count(sampled3) < 3)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases

2015-12-11 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master 1e799d617 -> 1e3526c2d


[SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases

The existing sample functions miss the parameter `seed`, however, the 
corresponding function interface in `generics` has such a parameter. Thus, 
although the function caller can call the function with the 'seed', we are not 
using the value.

This could cause SparkR unit tests failed. For example, I hit it in another PR:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull

Author: gatorsmile 

Closes #10160 from gatorsmile/sampleR.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e3526c2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e3526c2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e3526c2

Branch: refs/heads/master
Commit: 1e3526c2d3de723225024fedd45753b556e18fc6
Parents: 1e799d6
Author: gatorsmile 
Authored: Fri Dec 11 20:55:16 2015 -0800
Committer: Shivaram Venkataraman 
Committed: Fri Dec 11 20:55:16 2015 -0800

--
 R/pkg/R/DataFrame.R   | 17 +++--
 R/pkg/inst/tests/testthat/test_sparkSQL.R |  4 
 2 files changed, 15 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1e3526c2/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 975b058..764597d 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -662,6 +662,7 @@ setMethod("unique",
 #' @param x A SparkSQL DataFrame
 #' @param withReplacement Sampling with replacement or not
 #' @param fraction The (rough) sample target fraction
+#' @param seed Randomness seed value
 #'
 #' @family DataFrame functions
 #' @rdname sample
@@ -677,13 +678,17 @@ setMethod("unique",
 #' collect(sample(df, TRUE, 0.5))
 #'}
 setMethod("sample",
-  # TODO : Figure out how to send integer as java.lang.Long to JVM so
-  # we can send seed as an argument through callJMethod
   signature(x = "DataFrame", withReplacement = "logical",
 fraction = "numeric"),
-  function(x, withReplacement, fraction) {
+  function(x, withReplacement, fraction, seed) {
 if (fraction < 0.0) stop(cat("Negative fraction value:", fraction))
-sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction)
+if (!missing(seed)) {
+  # TODO : Figure out how to send integer as java.lang.Long to JVM 
so
+  # we can send seed as an argument through callJMethod
+  sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction, 
as.integer(seed))
+} else {
+  sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction)
+}
 dataFrame(sdf)
   })
 
@@ -692,8 +697,8 @@ setMethod("sample",
 setMethod("sample_frac",
   signature(x = "DataFrame", withReplacement = "logical",
 fraction = "numeric"),
-  function(x, withReplacement, fraction) {
-sample(x, withReplacement, fraction)
+  function(x, withReplacement, fraction, seed) {
+sample(x, withReplacement, fraction, seed)
   })
 
 #' nrow

http://git-wip-us.apache.org/repos/asf/spark/blob/1e3526c2/R/pkg/inst/tests/testthat/test_sparkSQL.R
--
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R 
b/R/pkg/inst/tests/testthat/test_sparkSQL.R
index ed9b2c9..071fd31 100644
--- a/R/pkg/inst/tests/testthat/test_sparkSQL.R
+++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R
@@ -724,6 +724,10 @@ test_that("sample on a DataFrame", {
   sampled2 <- sample(df, FALSE, 0.1, 0) # set seed for predictable result
   expect_true(count(sampled2) < 3)
 
+  count1 <- count(sample(df, FALSE, 0.1, 0))
+  count2 <- count(sample(df, FALSE, 0.1, 0))
+  expect_equal(count1, count2)
+
   # Also test sample_frac
   sampled3 <- sample_frac(df, FALSE, 0.1, 0) # set seed for predictable result
   expect_true(count(sampled3) < 3)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to dataframe_example.py

2015-12-11 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 75531c77e -> c2f20469d


[SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to 
dataframe_example.py

Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to 
avoid confusion.
#9873 finished the work of Scala example, here we focus on the Python one.
Move dataset_example.py to ```examples/ml``` and rename to 
```dataframe_example.py```.
BTW, fix minor missing issues of #9873.
cc mengxr

Author: Yanbo Liang 

Closes #9957 from yanboliang/SPARK-11978.

(cherry picked from commit a0ff6d16ef4bcc1b6ff7282e82a9b345d8449454)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c2f20469
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c2f20469
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c2f20469

Branch: refs/heads/branch-1.6
Commit: c2f20469d5b53a027b022e3c4a9bea57452c5ba6
Parents: 75531c7
Author: Yanbo Liang 
Authored: Fri Dec 11 18:02:24 2015 -0800
Committer: Joseph K. Bradley 
Committed: Fri Dec 11 18:02:37 2015 -0800

--
 .../src/main/python/ml/dataframe_example.py | 75 
 .../src/main/python/mllib/dataset_example.py| 63 
 .../spark/examples/ml/DataFrameExample.scala|  8 +--
 3 files changed, 79 insertions(+), 67 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c2f20469/examples/src/main/python/ml/dataframe_example.py
--
diff --git a/examples/src/main/python/ml/dataframe_example.py 
b/examples/src/main/python/ml/dataframe_example.py
new file mode 100644
index 000..d2644ca
--- /dev/null
+++ b/examples/src/main/python/ml/dataframe_example.py
@@ -0,0 +1,75 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+An example of how to use DataFrame for ML. Run with::
+bin/spark-submit examples/src/main/python/ml/dataframe_example.py 
+"""
+from __future__ import print_function
+
+import os
+import sys
+import tempfile
+import shutil
+
+from pyspark import SparkContext
+from pyspark.sql import SQLContext
+from pyspark.mllib.stat import Statistics
+
+if __name__ == "__main__":
+if len(sys.argv) > 2:
+print("Usage: dataframe_example.py ", file=sys.stderr)
+exit(-1)
+sc = SparkContext(appName="DataFrameExample")
+sqlContext = SQLContext(sc)
+if len(sys.argv) == 2:
+input = sys.argv[1]
+else:
+input = "data/mllib/sample_libsvm_data.txt"
+
+# Load input data
+print("Loading LIBSVM file with UDT from " + input + ".")
+df = sqlContext.read.format("libsvm").load(input).cache()
+print("Schema from LIBSVM:")
+df.printSchema()
+print("Loaded training data as a DataFrame with " +
+  str(df.count()) + " records.")
+
+# Show statistical summary of labels.
+labelSummary = df.describe("label")
+labelSummary.show()
+
+# Convert features column to an RDD of vectors.
+features = df.select("features").map(lambda r: r.features)
+summary = Statistics.colStats(features)
+print("Selected features column with average values:\n" +
+  str(summary.mean()))
+
+# Save the records in a parquet file.
+tempdir = tempfile.NamedTemporaryFile(delete=False).name
+os.unlink(tempdir)
+print("Saving to " + tempdir + " as Parquet file.")
+df.write.parquet(tempdir)
+
+# Load the records back.
+print("Loading Parquet file with UDT from " + tempdir)
+newDF = sqlContext.read.parquet(tempdir)
+print("Schema from Parquet:")
+newDF.printSchema()
+shutil.rmtree(tempdir)
+
+sc.stop()

http://git-wip-us.apache.org/repos/asf/spark/blob/c2f20469/examples/src/main/python/mllib/dataset_example.py
--
diff --git a/examples/src/main/python/mllib/dataset_example.py 
b/examples/src/main/python/mllib/dataset_example.py
deleted file

spark git commit: [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files

2015-12-11 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master c119a34d1 -> 0fb982555


[SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files

* ```jsonFile``` should support multiple input files, such as:
```R
jsonFile(sqlContext, c(âpath1â, âpath2â)) # character vector as 
arguments
jsonFile(sqlContext, âpath1,path2â)
```
* Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be 
removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use 
```read.json``` at SparkR side.
* Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still 
keep jsonFile test case.
* If this PR is accepted, we should also make almost the same change for 
```parquetFile```.

cc felixcheung sun-rui shivaram

Author: Yanbo Liang 

Closes #10145 from yanboliang/spark-12146.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0fb98255
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0fb98255
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0fb98255

Branch: refs/heads/master
Commit: 0fb9825556dbbcc98d7eafe9ddea8676301e09bb
Parents: c119a34
Author: Yanbo Liang 
Authored: Fri Dec 11 11:47:35 2015 -0800
Committer: Shivaram Venkataraman 
Committed: Fri Dec 11 11:47:35 2015 -0800

--
 R/pkg/NAMESPACE   |   1 +
 R/pkg/R/DataFrame.R   | 102 ++---
 R/pkg/R/SQLContext.R  |  29 +++---
 R/pkg/inst/tests/testthat/test_sparkSQL.R | 120 ++---
 examples/src/main/r/dataframe.R   |   2 +-
 5 files changed, 138 insertions(+), 116 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0fb98255/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index ba64bc5..cab39d6 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -267,6 +267,7 @@ export("as.DataFrame",
"createExternalTable",
"dropTempTable",
"jsonFile",
+   "read.json",
"loadDF",
"parquetFile",
"read.df",

http://git-wip-us.apache.org/repos/asf/spark/blob/0fb98255/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index f4c4a25..975b058 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -24,14 +24,14 @@ setOldClass("jobj")
 
 #' @title S4 class that represents a DataFrame
 #' @description DataFrames can be created using functions like 
\link{createDataFrame},
-#'  \link{jsonFile}, \link{table} etc.
+#'  \link{read.json}, \link{table} etc.
 #' @family DataFrame functions
 #' @rdname DataFrame
 #' @docType class
 #'
 #' @slot env An R environment that stores bookkeeping states of the DataFrame
 #' @slot sdf A Java object reference to the backing Scala DataFrame
-#' @seealso \link{createDataFrame}, \link{jsonFile}, \link{table}
+#' @seealso \link{createDataFrame}, \link{read.json}, \link{table}
 #' @seealso 
\url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes}
 #' @export
 #' @examples
@@ -77,7 +77,7 @@ dataFrame <- function(sdf, isCached = FALSE) {
 #' sc <- sparkR.init()
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
-#' df <- jsonFile(sqlContext, path)
+#' df <- read.json(sqlContext, path)
 #' printSchema(df)
 #'}
 setMethod("printSchema",
@@ -102,7 +102,7 @@ setMethod("printSchema",
 #' sc <- sparkR.init()
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
-#' df <- jsonFile(sqlContext, path)
+#' df <- read.json(sqlContext, path)
 #' dfSchema <- schema(df)
 #'}
 setMethod("schema",
@@ -126,7 +126,7 @@ setMethod("schema",
 #' sc <- sparkR.init()
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
-#' df <- jsonFile(sqlContext, path)
+#' df <- read.json(sqlContext, path)
 #' explain(df, TRUE)
 #'}
 setMethod("explain",
@@ -157,7 +157,7 @@ setMethod("explain",
 #' sc <- sparkR.init()
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
-#' df <- jsonFile(sqlContext, path)
+#' df <- read.json(sqlContext, path)
 #' isLocal(df)
 #'}
 setMethod("isLocal",
@@ -182,7 +182,7 @@ setMethod("isLocal",
 #' sc <- sparkR.init()
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
-#' df <- jsonFile(sqlContext, path)
+#' df <- read.json(sqlContext, path)
 #' showDF(df)
 #'}
 setMethod("showDF",
@@ -207,7 +207,7 @@ setMethod("showDF",
 #' sc <- sparkR.init()
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
-#' df <- jsonFile(sqlContext, path)
+#' df <- read.json(sqlContext, path)
 #' df
 #'}
 setMethod("show", "DataFrame",
@@ -234,7 +234,7 @@ setMethod("show",

spark git commit: [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files

2015-12-11 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 2e4523161 -> f05bae4a3


[SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files

* ```jsonFile``` should support multiple input files, such as:
```R
jsonFile(sqlContext, c(âpath1â, âpath2â)) # character vector as 
arguments
jsonFile(sqlContext, âpath1,path2â)
```
* Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be 
removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use 
```read.json``` at SparkR side.
* Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still 
keep jsonFile test case.
* If this PR is accepted, we should also make almost the same change for 
```parquetFile```.

cc felixcheung sun-rui shivaram

Author: Yanbo Liang 

Closes #10145 from yanboliang/spark-12146.

(cherry picked from commit 0fb9825556dbbcc98d7eafe9ddea8676301e09bb)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f05bae4a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f05bae4a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f05bae4a

Branch: refs/heads/branch-1.6
Commit: f05bae4a30c422f0d0b2ab1e41d32e9d483fa675
Parents: 2e45231
Author: Yanbo Liang 
Authored: Fri Dec 11 11:47:35 2015 -0800
Committer: Shivaram Venkataraman 
Committed: Fri Dec 11 11:47:43 2015 -0800

--
 R/pkg/NAMESPACE   |   1 +
 R/pkg/R/DataFrame.R   | 102 ++---
 R/pkg/R/SQLContext.R  |  29 +++---
 R/pkg/inst/tests/testthat/test_sparkSQL.R | 120 ++---
 examples/src/main/r/dataframe.R   |   2 +-
 5 files changed, 138 insertions(+), 116 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f05bae4a/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index ba64bc5..cab39d6 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -267,6 +267,7 @@ export("as.DataFrame",
"createExternalTable",
"dropTempTable",
"jsonFile",
+   "read.json",
"loadDF",
"parquetFile",
"read.df",

http://git-wip-us.apache.org/repos/asf/spark/blob/f05bae4a/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index f4c4a25..975b058 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -24,14 +24,14 @@ setOldClass("jobj")
 
 #' @title S4 class that represents a DataFrame
 #' @description DataFrames can be created using functions like 
\link{createDataFrame},
-#'  \link{jsonFile}, \link{table} etc.
+#'  \link{read.json}, \link{table} etc.
 #' @family DataFrame functions
 #' @rdname DataFrame
 #' @docType class
 #'
 #' @slot env An R environment that stores bookkeeping states of the DataFrame
 #' @slot sdf A Java object reference to the backing Scala DataFrame
-#' @seealso \link{createDataFrame}, \link{jsonFile}, \link{table}
+#' @seealso \link{createDataFrame}, \link{read.json}, \link{table}
 #' @seealso 
\url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes}
 #' @export
 #' @examples
@@ -77,7 +77,7 @@ dataFrame <- function(sdf, isCached = FALSE) {
 #' sc <- sparkR.init()
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
-#' df <- jsonFile(sqlContext, path)
+#' df <- read.json(sqlContext, path)
 #' printSchema(df)
 #'}
 setMethod("printSchema",
@@ -102,7 +102,7 @@ setMethod("printSchema",
 #' sc <- sparkR.init()
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
-#' df <- jsonFile(sqlContext, path)
+#' df <- read.json(sqlContext, path)
 #' dfSchema <- schema(df)
 #'}
 setMethod("schema",
@@ -126,7 +126,7 @@ setMethod("schema",
 #' sc <- sparkR.init()
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
-#' df <- jsonFile(sqlContext, path)
+#' df <- read.json(sqlContext, path)
 #' explain(df, TRUE)
 #'}
 setMethod("explain",
@@ -157,7 +157,7 @@ setMethod("explain",
 #' sc <- sparkR.init()
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
-#' df <- jsonFile(sqlContext, path)
+#' df <- read.json(sqlContext, path)
 #' isLocal(df)
 #'}
 setMethod("isLocal",
@@ -182,7 +182,7 @@ setMethod("isLocal",
 #' sc <- sparkR.init()
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
-#' df <- jsonFile(sqlContext, path)
+#' df <- read.json(sqlContext, path)
 #' showDF(df)
 #'}
 setMethod("showDF",
@@ -207,7 +207,7 @@ setMethod("showDF",
 #' sc <- sparkR.init()
 #' sqlContext <- sparkRSQL.init(sc)
 #' path <- "path/to/file.json"
-#' df <-

spark git commit: [SPARK-12258] [SQL] passing null into ScalaUDF (follow-up)

2015-12-11 Thread davies

Repository: spark
Updated Branches:
  refs/heads/master 518ab5101 -> c119a34d1


[SPARK-12258] [SQL] passing null into ScalaUDF (follow-up)

This is a follow-up PR for #10259

Author: Davies Liu 

Closes #10266 from davies/null_udf2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c119a34d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c119a34d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c119a34d

Branch: refs/heads/master
Commit: c119a34d1e9e599e302acfda92e5de681086a19f
Parents: 518ab51
Author: Davies Liu 
Authored: Fri Dec 11 11:15:53 2015 -0800
Committer: Davies Liu 
Committed: Fri Dec 11 11:15:53 2015 -0800

--
 .../sql/catalyst/expressions/ScalaUDF.scala | 31 +++-
 .../org/apache/spark/sql/DataFrameSuite.scala   |  8 +++--
 2 files changed, 23 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c119a34d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
index 5deb2f8..85faa19 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
@@ -1029,24 +1029,27 @@ case class ScalaUDF(
 // such as IntegerType, its javaType is `int` and the returned type of 
user-defined
 // function is Object. Trying to convert an Object to `int` will cause 
casting exception.
 val evalCode = evals.map(_.code).mkString
-val funcArguments = converterTerms.zipWithIndex.map {
-  case (converter, i) =>
-val eval = evals(i)
-val dt = children(i).dataType
-s"$converter.apply(${eval.isNull} ? null : (${ctx.boxedType(dt)}) 
${eval.value})"
-}.mkString(",")
-val callFunc = s"${ctx.boxedType(ctx.javaType(dataType))} $resultTerm = " +
-  s"(${ctx.boxedType(ctx.javaType(dataType))})${catalystConverterTerm}" +
-s".apply($funcTerm.apply($funcArguments));"
+val (converters, funcArguments) = converterTerms.zipWithIndex.map { case 
(converter, i) =>
+  val eval = evals(i)
+  val argTerm = ctx.freshName("arg")
+  val convert = s"Object $argTerm = ${eval.isNull} ? null : 
$converter.apply(${eval.value});"
+  (convert, argTerm)
+}.unzip
 
-evalCode + s"""
-  ${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)};
-  Boolean ${ev.isNull};
+val callFunc = s"${ctx.boxedType(dataType)} $resultTerm = " +
+  s"(${ctx.boxedType(dataType)})${catalystConverterTerm}" +
+s".apply($funcTerm.apply(${funcArguments.mkString(", ")}));"
 
+s"""
+  $evalCode
+  ${converters.mkString("\n")}
   $callFunc
 
-  ${ev.value} = $resultTerm;
-  ${ev.isNull} = $resultTerm == null;
+  boolean ${ev.isNull} = $resultTerm == null;
+  ${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)};
+  if (!${ev.isNull}) {
+${ev.value} = $resultTerm;
+  }
 """
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/c119a34d/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
index 8887dc6..5353fef 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
@@ -1144,9 +1144,13 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 
 // passing null into the UDF that could handle it
 val boxedUDF = udf[java.lang.Integer, java.lang.Integer] {
-  (i: java.lang.Integer) => if (i == null) -10 else i * 2
+  (i: java.lang.Integer) => if (i == null) -10 else null
 }
-checkAnswer(df.select(boxedUDF($"age")), Row(44) :: Row(-10) :: Nil)
+checkAnswer(df.select(boxedUDF($"age")), Row(null) :: Row(-10) :: Nil)
+
+sqlContext.udf.register("boxedUDF",
+  (i: java.lang.Integer) => (if (i == null) -10 else null): 
java.lang.Integer)
+checkAnswer(sql("select boxedUDF(null), boxedUDF(-1)"), Row(-10, null) :: 
Nil)
 
 val primitiveUDF = udf((i: Int) => i * 2)
 checkAnswer(df.select(primitiveUDF($"age")), Row(44) :: Row(null) :: Nil)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For

[spark] Git Push Summary

2015-12-11 Thread marmbrus

Repository: spark
Updated Tags:  refs/tags/v1.6.0-rc2 [deleted] 3e39925f9

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] Git Push Summary

2015-12-11 Thread pwendell

Repository: spark
Updated Tags:  refs/tags/v1.6.0-rc2 [created] 23f8dfd45

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/2] spark git commit: Preparing development version 1.6.0-SNAPSHOT

2015-12-11 Thread pwendell

Preparing development version 1.6.0-SNAPSHOT


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2e452316
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2e452316
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2e452316

Branch: refs/heads/branch-1.6
Commit: 2e4523161ddf2417f2570bb75cc2d6694813adf5
Parents: 23f8dfd
Author: Patrick Wendell 
Authored: Fri Dec 11 11:25:09 2015 -0800
Committer: Patrick Wendell 
Committed: Fri Dec 11 11:25:09 2015 -0800

--
 assembly/pom.xml| 2 +-
 bagel/pom.xml   | 2 +-
 core/pom.xml| 2 +-
 docker-integration-tests/pom.xml| 2 +-
 examples/pom.xml| 2 +-
 external/flume-assembly/pom.xml | 2 +-
 external/flume-sink/pom.xml | 2 +-
 external/flume/pom.xml  | 2 +-
 external/kafka-assembly/pom.xml | 2 +-
 external/kafka/pom.xml  | 2 +-
 external/mqtt-assembly/pom.xml  | 2 +-
 external/mqtt/pom.xml   | 2 +-
 external/twitter/pom.xml| 2 +-
 external/zeromq/pom.xml | 2 +-
 extras/java8-tests/pom.xml  | 2 +-
 extras/kinesis-asl-assembly/pom.xml | 2 +-
 extras/kinesis-asl/pom.xml  | 2 +-
 extras/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml  | 2 +-
 launcher/pom.xml| 2 +-
 mllib/pom.xml   | 2 +-
 network/common/pom.xml  | 2 +-
 network/shuffle/pom.xml | 2 +-
 network/yarn/pom.xml| 2 +-
 pom.xml | 2 +-
 repl/pom.xml| 2 +-
 sql/catalyst/pom.xml| 2 +-
 sql/core/pom.xml| 2 +-
 sql/hive-thriftserver/pom.xml   | 2 +-
 sql/hive/pom.xml| 2 +-
 streaming/pom.xml   | 2 +-
 tags/pom.xml| 2 +-
 tools/pom.xml   | 2 +-
 unsafe/pom.xml  | 2 +-
 yarn/pom.xml| 2 +-
 35 files changed, 35 insertions(+), 35 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index fbabaa5..4b60ee0 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0
+1.6.0-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/bagel/pom.xml
--
diff --git a/bagel/pom.xml b/bagel/pom.xml
index 1b3e417..672e946 100644
--- a/bagel/pom.xml
+++ b/bagel/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0
+1.6.0-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index 15b8d75..61744bb 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0
+1.6.0-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/docker-integration-tests/pom.xml
--
diff --git a/docker-integration-tests/pom.xml b/docker-integration-tests/pom.xml
index d579879..39d3f34 100644
--- a/docker-integration-tests/pom.xml
+++ b/docker-integration-tests/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0
+1.6.0-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/examples/pom.xml
--
diff --git a/examples/pom.xml b/examples/pom.xml
index 37b15bb..f5ab2a7 100644
--- a/examples/pom.xml
+++ b/examples/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0
+1.6.0-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/external/flume-assembly/pom.xml
--
diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml
index 295455a..dceedcf 100644
--- a/external/flume-assembly/pom.xml
+++ b/external/flume-assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0
+1.6.0-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/external/flume-sink/pom.xml
--
diff --git a/external/flume-sink/pom.xml

[1/2] spark git commit: Preparing Spark release v1.6.0-rc2

2015-12-11 Thread pwendell

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 eec36607f -> 2e4523161


Preparing Spark release v1.6.0-rc2


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/23f8dfd4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/23f8dfd4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/23f8dfd4

Branch: refs/heads/branch-1.6
Commit: 23f8dfd45187cb8f2216328ab907ddb5fbdffd0b
Parents: eec3660
Author: Patrick Wendell 
Authored: Fri Dec 11 11:25:03 2015 -0800
Committer: Patrick Wendell 
Committed: Fri Dec 11 11:25:03 2015 -0800

--
 assembly/pom.xml| 2 +-
 bagel/pom.xml   | 2 +-
 core/pom.xml| 2 +-
 docker-integration-tests/pom.xml| 2 +-
 examples/pom.xml| 2 +-
 external/flume-assembly/pom.xml | 2 +-
 external/flume-sink/pom.xml | 2 +-
 external/flume/pom.xml  | 2 +-
 external/kafka-assembly/pom.xml | 2 +-
 external/kafka/pom.xml  | 2 +-
 external/mqtt-assembly/pom.xml  | 2 +-
 external/mqtt/pom.xml   | 2 +-
 external/twitter/pom.xml| 2 +-
 external/zeromq/pom.xml | 2 +-
 extras/java8-tests/pom.xml  | 2 +-
 extras/kinesis-asl-assembly/pom.xml | 2 +-
 extras/kinesis-asl/pom.xml  | 2 +-
 extras/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml  | 2 +-
 launcher/pom.xml| 2 +-
 mllib/pom.xml   | 2 +-
 network/common/pom.xml  | 2 +-
 network/shuffle/pom.xml | 2 +-
 network/yarn/pom.xml| 2 +-
 pom.xml | 2 +-
 repl/pom.xml| 2 +-
 sql/catalyst/pom.xml| 2 +-
 sql/core/pom.xml| 2 +-
 sql/hive-thriftserver/pom.xml   | 2 +-
 sql/hive/pom.xml| 2 +-
 streaming/pom.xml   | 2 +-
 tags/pom.xml| 2 +-
 tools/pom.xml   | 2 +-
 unsafe/pom.xml  | 2 +-
 yarn/pom.xml| 2 +-
 35 files changed, 35 insertions(+), 35 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 4b60ee0..fbabaa5 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0-SNAPSHOT
+1.6.0
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/bagel/pom.xml
--
diff --git a/bagel/pom.xml b/bagel/pom.xml
index 672e946..1b3e417 100644
--- a/bagel/pom.xml
+++ b/bagel/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0-SNAPSHOT
+1.6.0
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index 61744bb..15b8d75 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0-SNAPSHOT
+1.6.0
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/docker-integration-tests/pom.xml
--
diff --git a/docker-integration-tests/pom.xml b/docker-integration-tests/pom.xml
index 39d3f34..d579879 100644
--- a/docker-integration-tests/pom.xml
+++ b/docker-integration-tests/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0-SNAPSHOT
+1.6.0
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/examples/pom.xml
--
diff --git a/examples/pom.xml b/examples/pom.xml
index f5ab2a7..37b15bb 100644
--- a/examples/pom.xml
+++ b/examples/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0-SNAPSHOT
+1.6.0
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/external/flume-assembly/pom.xml
--
diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml
index dceedcf..295455a 100644
--- a/external/flume-assembly/pom.xml
+++ b/external/flume-assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.0-SNAPSHOT
+1.6.0
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/external/flume-sink/pom.xml

spark git commit: [SPARK-11080] [SQL] Incorporate per-JVM id into ExprId to prevent unsafe cross-JVM comparisions

2015-12-11 Thread davies

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 cb0246c93 -> 5e603a51c


[SPARK-11080] [SQL] Incorporate per-JVM id into ExprId to prevent unsafe 
cross-JVM comparisions

In the current implementation of named expressions' `ExprIds`, we rely on a 
per-JVM AtomicLong to ensure that expression ids are unique within a JVM. 
However, these expression ids will not be _globally_ unique. This opens the 
potential for id collisions if new expression ids happen to be created inside 
of tasks rather than on the driver.

There are currently a few cases where tasks allocate expression ids, which 
happen to be safe because those expressions are never compared to expressions 
created on the driver. In order to guard against the introduction of invalid 
comparisons between driver-created and executor-created expression ids, this 
patch extends `ExprId` to incorporate a UUID to identify the JVM that created 
the id, which prevents collisions.

Author: Josh Rosen 

Closes #9093 from JoshRosen/SPARK-11080.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5e603a51
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5e603a51
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5e603a51

Branch: refs/heads/branch-1.5
Commit: 5e603a51c09a94280c346bee12def0c49479d069
Parents: cb0246c
Author: Josh Rosen 
Authored: Tue Oct 13 15:09:31 2015 -0700
Committer: Davies Liu 
Committed: Fri Dec 11 12:37:54 2015 -0800

--
 .../sql/catalyst/expressions/namedExpressions.scala  | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5e603a51/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
index 5768c60..8957df0 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
@@ -17,6 +17,8 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
+import java.util.UUID
+
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
 import org.apache.spark.sql.catalyst.expressions.codegen._
@@ -24,16 +26,23 @@ import org.apache.spark.sql.types._
 
 object NamedExpression {
   private val curId = new java.util.concurrent.atomic.AtomicLong()
-  def newExprId: ExprId = ExprId(curId.getAndIncrement())
+  private[expressions] val jvmId = UUID.randomUUID()
+  def newExprId: ExprId = ExprId(curId.getAndIncrement(), jvmId)
   def unapply(expr: NamedExpression): Option[(String, DataType)] = 
Some(expr.name, expr.dataType)
 }
 
 /**
- * A globally unique (within this JVM) id for a given named expression.
+ * A globally unique id for a given named expression.
  * Used to identify which attribute output by a relation is being
  * referenced in a subsequent computation.
+ *
+ * The `id` field is unique within a given JVM, while the `uuid` is used to 
uniquely identify JVMs.
  */
-case class ExprId(id: Long)
+case class ExprId(id: Long, jvmId: UUID)
+
+object ExprId {
+  def apply(id: Long): ExprId = ExprId(id, NamedExpression.jvmId)
+}
 
 /**
  * An [[Expression]] that is named.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions

2015-12-11 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/master a0ff6d16e -> 1e799d617


[SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions

Modifies the String overload to call the Column overload and ensures this is 
called in a test.

Author: Ankur Dave 

Closes #10271 from ankurdave/SPARK-12298.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e799d61
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e799d61
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e799d61

Branch: refs/heads/master
Commit: 1e799d617a28cd0eaa8f22d103ea8248c4655ae5
Parents: a0ff6d1
Author: Ankur Dave 
Authored: Fri Dec 11 19:07:48 2015 -0800
Committer: Yin Huai 
Committed: Fri Dec 11 19:07:48 2015 -0800

--
 sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala | 2 +-
 .../src/test/scala/org/apache/spark/sql/DataFrameSuite.scala | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1e799d61/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
index da180a2..497bd48 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
@@ -609,7 +609,7 @@ class DataFrame private[sql](
*/
   @scala.annotation.varargs
   def sortWithinPartitions(sortCol: String, sortCols: String*): DataFrame = {
-sortWithinPartitions(sortCol, sortCols : _*)
+sortWithinPartitions((sortCol +: sortCols).map(Column(_)) : _*)
   }
 
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/1e799d61/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
index 5353fef..c0bbf73 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
@@ -1090,8 +1090,8 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 }
 
 // Distribute into one partition and order by. This partition should 
contain all the values.
-val df6 = data.repartition(1, $"a").sortWithinPartitions($"b".asc)
-// Walk each partition and verify that it is sorted descending and not 
globally sorted.
+val df6 = data.repartition(1, $"a").sortWithinPartitions("b")
+// Walk each partition and verify that it is sorted ascending and not 
globally sorted.
 df6.rdd.foreachPartition { p =>
   var previousValue: Int = -1
   var allSequential: Boolean = true


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions

2015-12-11 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 c2f20469d -> 03d801587


[SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions

Modifies the String overload to call the Column overload and ensures this is 
called in a test.

Author: Ankur Dave 

Closes #10271 from ankurdave/SPARK-12298.

(cherry picked from commit 1e799d617a28cd0eaa8f22d103ea8248c4655ae5)
Signed-off-by: Yin Huai 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/03d80158
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/03d80158
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/03d80158

Branch: refs/heads/branch-1.6
Commit: 03d801587936fe92d4e7541711f1f41965e64956
Parents: c2f2046
Author: Ankur Dave 
Authored: Fri Dec 11 19:07:48 2015 -0800
Committer: Yin Huai 
Committed: Fri Dec 11 19:08:03 2015 -0800

--
 sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala | 2 +-
 .../src/test/scala/org/apache/spark/sql/DataFrameSuite.scala | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/03d80158/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
index 1acfe84..cc8b70b 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
@@ -609,7 +609,7 @@ class DataFrame private[sql](
*/
   @scala.annotation.varargs
   def sortWithinPartitions(sortCol: String, sortCols: String*): DataFrame = {
-sortWithinPartitions(sortCol, sortCols : _*)
+sortWithinPartitions((sortCol +: sortCols).map(Column(_)) : _*)
   }
 
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/03d80158/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
index 1763eb5..854dec0 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
@@ -1083,8 +1083,8 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 }
 
 // Distribute into one partition and order by. This partition should 
contain all the values.
-val df6 = data.repartition(1, $"a").sortWithinPartitions($"b".asc)
-// Walk each partition and verify that it is sorted descending and not 
globally sorted.
+val df6 = data.repartition(1, $"a").sortWithinPartitions("b")
+// Walk each partition and verify that it is sorted ascending and not 
globally sorted.
 df6.rdd.foreachPartition { p =>
   var previousValue: Int = -1
   var allSequential: Boolean = true


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-11964][DOCS][ML] Add in Pipeline Import/Export Documentation

2015-12-11 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 0fb982555 -> aa305dcaf


[SPARK-11964][DOCS][ML] Add in Pipeline Import/Export Documentation

Adding in Pipeline Import and Export Documentation.

Author: anabranch 
Author: Bill Chambers 

Closes #10179 from anabranch/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aa305dca
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aa305dca
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aa305dca

Branch: refs/heads/master
Commit: aa305dcaf5b4148aba9e669e081d0b9235f50857
Parents: 0fb9825
Author: anabranch 
Authored: Fri Dec 11 12:55:56 2015 -0800
Committer: Joseph K. Bradley 
Committed: Fri Dec 11 12:55:56 2015 -0800

--
 docs/ml-guide.md | 13 +
 1 file changed, 13 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aa305dca/docs/ml-guide.md
--
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index 5c96c2b..44a316a 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -192,6 +192,10 @@ Parameters belong to specific instances of `Estimator`s 
and `Transformer`s.
 For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, 
then we can build a `ParamMap` with both `maxIter` parameters specified: 
`ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
 This is useful if there are two algorithms with the `maxIter` parameter in a 
`Pipeline`.
 
+## Saving and Loading Pipelines
+
+Often times it is worth it to save a model or a pipeline to disk for later 
use. In Spark 1.6, a model import/export functionality was added to the 
Pipeline API. Most basic transformers are supported as well as some of the more 
basic ML models. Please refer to the algorithm's API documentation to see if 
saving and loading is supported.
+
 # Code examples
 
 This section gives code examples illustrating the functionality discussed 
above.
@@ -455,6 +459,15 @@ val pipeline = new Pipeline()
 // Fit the pipeline to training documents.
 val model = pipeline.fit(training)
 
+// now we can optionally save the fitted pipeline to disk
+model.save("/tmp/spark-logistic-regression-model")
+
+// we can also save this unfit pipeline to disk
+pipeline.save("/tmp/unfit-lr-model")
+
+// and load it back in during production
+val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model")
+
 // Prepare test documents, which are unlabeled (id, text) tuples.
 val test = sqlContext.createDataFrame(Seq(
   (4L, "spark i j k"),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12273][STREAMING] Make Spark Streaming web UI list Receivers in order

spark git commit: [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue

spark git commit: [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue

spark git commit: [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue

spark git commit: [SPARK-12217][ML] Document invalid handling for StringIndexer

spark git commit: [SPARK-12217][ML] Document invalid handling for StringIndexer

spark git commit: [SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases

spark git commit: [SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases

spark git commit: [SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to dataframe_example.py

spark git commit: [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files

spark git commit: [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files

spark git commit: [SPARK-12258] [SQL] passing null into ScalaUDF (follow-up)

[spark] Git Push Summary

[spark] Git Push Summary

[2/2] spark git commit: Preparing development version 1.6.0-SNAPSHOT

[1/2] spark git commit: Preparing Spark release v1.6.0-rc2

spark git commit: [SPARK-11080] [SQL] Incorporate per-JVM id into ExprId to prevent unsafe cross-JVM comparisions

spark git commit: [SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions

spark git commit: [SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions

spark git commit: [SPARK-11964][DOCS][ML] Add in Pipeline Import/Export Documentation

20 matches

Site Navigation

Mail list logo

Footer information