spark git commit: [SPARK-12273][STREAMING] Make Spark Streaming web UI list Receivers in order
Repository: spark Updated Branches: refs/heads/master aa305dcaf -> 713e6959d [SPARK-12273][STREAMING] Make Spark Streaming web UI list Receivers in order Currently the Streaming web UI does NOT list Receivers in order; however, it seems more convenient for the users if Receivers are listed in order. ![spark-12273](https://cloud.githubusercontent.com/assets/15843379/11736602/0bb7f7a8-a00b-11e5-8e86-96ba9297fb12.png) Author: proflinCloses #10264 from proflin/Spark-12273. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/713e6959 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/713e6959 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/713e6959 Branch: refs/heads/master Commit: 713e6959d21d24382ef99bbd7e9da751a7ed388c Parents: aa305dc Author: proflin Authored: Fri Dec 11 13:50:36 2015 -0800 Committer: Shixiong Zhu Committed: Fri Dec 11 13:50:36 2015 -0800 -- .../scala/org/apache/spark/streaming/ui/StreamingPage.scala | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/713e6959/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala -- diff --git a/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala b/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala index 4588b21..88a4483 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala @@ -392,8 +392,9 @@ private[ui] class StreamingPage(parent: StreamingTab) maxX: Long, minY: Double, maxY: Double): Seq[Node] = { -val content = listener.receivedEventRateWithBatchTime.map { case (streamId, eventRates) => - generateInputDStreamRow(jsCollector, streamId, eventRates, minX, maxX, minY, maxY) +val content = listener.receivedEventRateWithBatchTime.toList.sortBy(_._1).map { + case (streamId, eventRates) => +generateInputDStreamRow(jsCollector, streamId, eventRates, minX, maxX, minY, maxY) }.foldLeft[Seq[Node]](Nil)(_ ++ _) // scalastyle:off - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue
Repository: spark Updated Branches: refs/heads/branch-1.6 2ddd10486 -> bfcc8cfee [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatri x` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike DusenberryCloses #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue. (cherry picked from commit 1b8220387e6903564f765fabb54be0420c3e99d7) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bfcc8cfe Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bfcc8cfe Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bfcc8cfe Branch: refs/heads/branch-1.6 Commit: bfcc8cfee7219e63d2f53fc36627f95dc60428eb Parents: 2ddd104 Author: Mike Dusenberry Authored: Fri Dec 11 14:21:33 2015 -0800 Committer: Joseph K. Bradley Committed: Fri Dec 11 14:21:48 2015 -0800 -- .../scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bfcc8cfe/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala index 2aa6aec..8d546e3 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala @@ -1143,7 +1143,7 @@ private[python] class PythonMLLibAPI extends Serializable { * Wrapper around RowMatrix constructor. */ def createRowMatrix(rows: JavaRDD[Vector], numRows: Long, numCols: Int): RowMatrix = { -new RowMatrix(rows.rdd, numRows, numCols) +new RowMatrix(rows.rdd.retag(classOf[Vector]), numRows, numCols) } /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue
Repository: spark Updated Branches: refs/heads/branch-1.5 5e603a51c -> e4cf12118 [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatri x` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike DusenberryCloses #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue. (cherry picked from commit 1b8220387e6903564f765fabb54be0420c3e99d7) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e4cf1211 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e4cf1211 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e4cf1211 Branch: refs/heads/branch-1.5 Commit: e4cf1211803097eb3cdd93deccb7eb996e12da3d Parents: 5e603a5 Author: Mike Dusenberry Authored: Fri Dec 11 14:21:33 2015 -0800 Committer: Joseph K. Bradley Committed: Fri Dec 11 14:22:37 2015 -0800 -- .../scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e4cf1211/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala index 06e13b7..2f8b5e5 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala @@ -1101,7 +1101,7 @@ private[python] class PythonMLLibAPI extends Serializable { * Wrapper around RowMatrix constructor. */ def createRowMatrix(rows: JavaRDD[Vector], numRows: Long, numCols: Int): RowMatrix = { -new RowMatrix(rows.rdd, numRows, numCols) +new RowMatrix(rows.rdd.retag(classOf[Vector]), numRows, numCols) } /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue
Repository: spark Updated Branches: refs/heads/master 713e6959d -> 1b8220387 [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatri x` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike DusenberryCloses #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1b822038 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1b822038 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1b822038 Branch: refs/heads/master Commit: 1b8220387e6903564f765fabb54be0420c3e99d7 Parents: 713e695 Author: Mike Dusenberry Authored: Fri Dec 11 14:21:33 2015 -0800 Committer: Joseph K. Bradley Committed: Fri Dec 11 14:21:33 2015 -0800 -- .../scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1b822038/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala index 2aa6aec..8d546e3 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala @@ -1143,7 +1143,7 @@ private[python] class PythonMLLibAPI extends Serializable { * Wrapper around RowMatrix constructor. */ def createRowMatrix(rows: JavaRDD[Vector], numRows: Long, numCols: Int): RowMatrix = { -new RowMatrix(rows.rdd, numRows, numCols) +new RowMatrix(rows.rdd.retag(classOf[Vector]), numRows, numCols) } /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12217][ML] Document invalid handling for StringIndexer
Repository: spark Updated Branches: refs/heads/branch-1.6 bfcc8cfee -> 75531c77e [SPARK-12217][ML] Document invalid handling for StringIndexer Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation. I wonder if I should also add a snippet to the code example, input welcome. Author: BenFradetCloses #10257 from BenFradet/SPARK-12217. (cherry picked from commit aea676ca2d07c72b1a752e9308c961118e5bfc3c) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/75531c77 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/75531c77 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/75531c77 Branch: refs/heads/branch-1.6 Commit: 75531c77e85073c7be18985a54c623710894d861 Parents: bfcc8cf Author: BenFradet Authored: Fri Dec 11 15:43:00 2015 -0800 Committer: Joseph K. Bradley Committed: Fri Dec 11 15:43:09 2015 -0800 -- docs/ml-features.md | 36 1 file changed, 36 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/75531c77/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index 6494fed..8b00cc6 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -459,6 +459,42 @@ column, we should get the following: "a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with index `2`. +Additionaly, there are two strategies regarding how `StringIndexer` will handle +unseen labels when you have fit a `StringIndexer` on one dataset and then use it +to transform another: + +- throw an exception (which is the default) +- skip the row containing the unseen label entirely + +**Examples** + +Let's go back to our previous example but this time reuse our previously defined +`StringIndexer` on the following dataset: + + + id | category +|-- + 0 | a + 1 | b + 2 | c + 3 | d + + +If you've not set how `StringIndexer` handles unseen labels or set it to +"error", an exception will be thrown. +However, if you had called `setHandleInvalid("skip")`, the following dataset +will be generated: + + + id | category | categoryIndex +|--|--- + 0 | a| 0.0 + 1 | b| 2.0 + 2 | c| 1.0 + + +Notice that the row containing "d" does not appear. + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12217][ML] Document invalid handling for StringIndexer
Repository: spark Updated Branches: refs/heads/master 1b8220387 -> aea676ca2 [SPARK-12217][ML] Document invalid handling for StringIndexer Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation. I wonder if I should also add a snippet to the code example, input welcome. Author: BenFradetCloses #10257 from BenFradet/SPARK-12217. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aea676ca Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aea676ca Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aea676ca Branch: refs/heads/master Commit: aea676ca2d07c72b1a752e9308c961118e5bfc3c Parents: 1b82203 Author: BenFradet Authored: Fri Dec 11 15:43:00 2015 -0800 Committer: Joseph K. Bradley Committed: Fri Dec 11 15:43:00 2015 -0800 -- docs/ml-features.md | 36 1 file changed, 36 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/aea676ca/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index 6494fed..8b00cc6 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -459,6 +459,42 @@ column, we should get the following: "a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with index `2`. +Additionaly, there are two strategies regarding how `StringIndexer` will handle +unseen labels when you have fit a `StringIndexer` on one dataset and then use it +to transform another: + +- throw an exception (which is the default) +- skip the row containing the unseen label entirely + +**Examples** + +Let's go back to our previous example but this time reuse our previously defined +`StringIndexer` on the following dataset: + + + id | category +|-- + 0 | a + 1 | b + 2 | c + 3 | d + + +If you've not set how `StringIndexer` handles unseen labels or set it to +"error", an exception will be thrown. +However, if you had called `setHandleInvalid("skip")`, the following dataset +will be generated: + + + id | category | categoryIndex +|--|--- + 0 | a| 0.0 + 1 | b| 2.0 + 2 | c| 1.0 + + +Notice that the row containing "d" does not appear. + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases
Repository: spark Updated Branches: refs/heads/branch-1.6 03d801587 -> 47461fea7 [SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases The existing sample functions miss the parameter `seed`, however, the corresponding function interface in `generics` has such a parameter. Thus, although the function caller can call the function with the 'seed', we are not using the value. This could cause SparkR unit tests failed. For example, I hit it in another PR: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull Author: gatorsmileCloses #10160 from gatorsmile/sampleR. (cherry picked from commit 1e3526c2d3de723225024fedd45753b556e18fc6) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/47461fea Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/47461fea Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/47461fea Branch: refs/heads/branch-1.6 Commit: 47461fea7c079819de6add308f823c7a8294f891 Parents: 03d8015 Author: gatorsmile Authored: Fri Dec 11 20:55:16 2015 -0800 Committer: Shivaram Venkataraman Committed: Fri Dec 11 20:55:24 2015 -0800 -- R/pkg/R/DataFrame.R | 17 +++-- R/pkg/inst/tests/testthat/test_sparkSQL.R | 4 2 files changed, 15 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/47461fea/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 975b058..764597d 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -662,6 +662,7 @@ setMethod("unique", #' @param x A SparkSQL DataFrame #' @param withReplacement Sampling with replacement or not #' @param fraction The (rough) sample target fraction +#' @param seed Randomness seed value #' #' @family DataFrame functions #' @rdname sample @@ -677,13 +678,17 @@ setMethod("unique", #' collect(sample(df, TRUE, 0.5)) #'} setMethod("sample", - # TODO : Figure out how to send integer as java.lang.Long to JVM so - # we can send seed as an argument through callJMethod signature(x = "DataFrame", withReplacement = "logical", fraction = "numeric"), - function(x, withReplacement, fraction) { + function(x, withReplacement, fraction, seed) { if (fraction < 0.0) stop(cat("Negative fraction value:", fraction)) -sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction) +if (!missing(seed)) { + # TODO : Figure out how to send integer as java.lang.Long to JVM so + # we can send seed as an argument through callJMethod + sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction, as.integer(seed)) +} else { + sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction) +} dataFrame(sdf) }) @@ -692,8 +697,8 @@ setMethod("sample", setMethod("sample_frac", signature(x = "DataFrame", withReplacement = "logical", fraction = "numeric"), - function(x, withReplacement, fraction) { -sample(x, withReplacement, fraction) + function(x, withReplacement, fraction, seed) { +sample(x, withReplacement, fraction, seed) }) #' nrow http://git-wip-us.apache.org/repos/asf/spark/blob/47461fea/R/pkg/inst/tests/testthat/test_sparkSQL.R -- diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R b/R/pkg/inst/tests/testthat/test_sparkSQL.R index ed9b2c9..071fd31 100644 --- a/R/pkg/inst/tests/testthat/test_sparkSQL.R +++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R @@ -724,6 +724,10 @@ test_that("sample on a DataFrame", { sampled2 <- sample(df, FALSE, 0.1, 0) # set seed for predictable result expect_true(count(sampled2) < 3) + count1 <- count(sample(df, FALSE, 0.1, 0)) + count2 <- count(sample(df, FALSE, 0.1, 0)) + expect_equal(count1, count2) + # Also test sample_frac sampled3 <- sample_frac(df, FALSE, 0.1, 0) # set seed for predictable result expect_true(count(sampled3) < 3) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases
Repository: spark Updated Branches: refs/heads/master 1e799d617 -> 1e3526c2d [SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases The existing sample functions miss the parameter `seed`, however, the corresponding function interface in `generics` has such a parameter. Thus, although the function caller can call the function with the 'seed', we are not using the value. This could cause SparkR unit tests failed. For example, I hit it in another PR: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull Author: gatorsmileCloses #10160 from gatorsmile/sampleR. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e3526c2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e3526c2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e3526c2 Branch: refs/heads/master Commit: 1e3526c2d3de723225024fedd45753b556e18fc6 Parents: 1e799d6 Author: gatorsmile Authored: Fri Dec 11 20:55:16 2015 -0800 Committer: Shivaram Venkataraman Committed: Fri Dec 11 20:55:16 2015 -0800 -- R/pkg/R/DataFrame.R | 17 +++-- R/pkg/inst/tests/testthat/test_sparkSQL.R | 4 2 files changed, 15 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1e3526c2/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 975b058..764597d 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -662,6 +662,7 @@ setMethod("unique", #' @param x A SparkSQL DataFrame #' @param withReplacement Sampling with replacement or not #' @param fraction The (rough) sample target fraction +#' @param seed Randomness seed value #' #' @family DataFrame functions #' @rdname sample @@ -677,13 +678,17 @@ setMethod("unique", #' collect(sample(df, TRUE, 0.5)) #'} setMethod("sample", - # TODO : Figure out how to send integer as java.lang.Long to JVM so - # we can send seed as an argument through callJMethod signature(x = "DataFrame", withReplacement = "logical", fraction = "numeric"), - function(x, withReplacement, fraction) { + function(x, withReplacement, fraction, seed) { if (fraction < 0.0) stop(cat("Negative fraction value:", fraction)) -sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction) +if (!missing(seed)) { + # TODO : Figure out how to send integer as java.lang.Long to JVM so + # we can send seed as an argument through callJMethod + sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction, as.integer(seed)) +} else { + sdf <- callJMethod(x@sdf, "sample", withReplacement, fraction) +} dataFrame(sdf) }) @@ -692,8 +697,8 @@ setMethod("sample", setMethod("sample_frac", signature(x = "DataFrame", withReplacement = "logical", fraction = "numeric"), - function(x, withReplacement, fraction) { -sample(x, withReplacement, fraction) + function(x, withReplacement, fraction, seed) { +sample(x, withReplacement, fraction, seed) }) #' nrow http://git-wip-us.apache.org/repos/asf/spark/blob/1e3526c2/R/pkg/inst/tests/testthat/test_sparkSQL.R -- diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R b/R/pkg/inst/tests/testthat/test_sparkSQL.R index ed9b2c9..071fd31 100644 --- a/R/pkg/inst/tests/testthat/test_sparkSQL.R +++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R @@ -724,6 +724,10 @@ test_that("sample on a DataFrame", { sampled2 <- sample(df, FALSE, 0.1, 0) # set seed for predictable result expect_true(count(sampled2) < 3) + count1 <- count(sample(df, FALSE, 0.1, 0)) + count2 <- count(sample(df, FALSE, 0.1, 0)) + expect_equal(count1, count2) + # Also test sample_frac sampled3 <- sample_frac(df, FALSE, 0.1, 0) # set seed for predictable result expect_true(count(sampled3) < 3) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to dataframe_example.py
Repository: spark Updated Branches: refs/heads/branch-1.6 75531c77e -> c2f20469d [SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to dataframe_example.py Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to avoid confusion. #9873 finished the work of Scala example, here we focus on the Python one. Move dataset_example.py to ```examples/ml``` and rename to ```dataframe_example.py```. BTW, fix minor missing issues of #9873. cc mengxr Author: Yanbo LiangCloses #9957 from yanboliang/SPARK-11978. (cherry picked from commit a0ff6d16ef4bcc1b6ff7282e82a9b345d8449454) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c2f20469 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c2f20469 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c2f20469 Branch: refs/heads/branch-1.6 Commit: c2f20469d5b53a027b022e3c4a9bea57452c5ba6 Parents: 75531c7 Author: Yanbo Liang Authored: Fri Dec 11 18:02:24 2015 -0800 Committer: Joseph K. Bradley Committed: Fri Dec 11 18:02:37 2015 -0800 -- .../src/main/python/ml/dataframe_example.py | 75 .../src/main/python/mllib/dataset_example.py| 63 .../spark/examples/ml/DataFrameExample.scala| 8 +-- 3 files changed, 79 insertions(+), 67 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c2f20469/examples/src/main/python/ml/dataframe_example.py -- diff --git a/examples/src/main/python/ml/dataframe_example.py b/examples/src/main/python/ml/dataframe_example.py new file mode 100644 index 000..d2644ca --- /dev/null +++ b/examples/src/main/python/ml/dataframe_example.py @@ -0,0 +1,75 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +An example of how to use DataFrame for ML. Run with:: +bin/spark-submit examples/src/main/python/ml/dataframe_example.py +""" +from __future__ import print_function + +import os +import sys +import tempfile +import shutil + +from pyspark import SparkContext +from pyspark.sql import SQLContext +from pyspark.mllib.stat import Statistics + +if __name__ == "__main__": +if len(sys.argv) > 2: +print("Usage: dataframe_example.py ", file=sys.stderr) +exit(-1) +sc = SparkContext(appName="DataFrameExample") +sqlContext = SQLContext(sc) +if len(sys.argv) == 2: +input = sys.argv[1] +else: +input = "data/mllib/sample_libsvm_data.txt" + +# Load input data +print("Loading LIBSVM file with UDT from " + input + ".") +df = sqlContext.read.format("libsvm").load(input).cache() +print("Schema from LIBSVM:") +df.printSchema() +print("Loaded training data as a DataFrame with " + + str(df.count()) + " records.") + +# Show statistical summary of labels. +labelSummary = df.describe("label") +labelSummary.show() + +# Convert features column to an RDD of vectors. +features = df.select("features").map(lambda r: r.features) +summary = Statistics.colStats(features) +print("Selected features column with average values:\n" + + str(summary.mean())) + +# Save the records in a parquet file. +tempdir = tempfile.NamedTemporaryFile(delete=False).name +os.unlink(tempdir) +print("Saving to " + tempdir + " as Parquet file.") +df.write.parquet(tempdir) + +# Load the records back. +print("Loading Parquet file with UDT from " + tempdir) +newDF = sqlContext.read.parquet(tempdir) +print("Schema from Parquet:") +newDF.printSchema() +shutil.rmtree(tempdir) + +sc.stop() http://git-wip-us.apache.org/repos/asf/spark/blob/c2f20469/examples/src/main/python/mllib/dataset_example.py -- diff --git a/examples/src/main/python/mllib/dataset_example.py b/examples/src/main/python/mllib/dataset_example.py deleted file
spark git commit: [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files
Repository: spark Updated Branches: refs/heads/master c119a34d1 -> 0fb982555 [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files * ```jsonFile``` should support multiple input files, such as: ```R jsonFile(sqlContext, c(âpath1â, âpath2â)) # character vector as arguments jsonFile(sqlContext, âpath1,path2â) ``` * Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side. * Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case. * If this PR is accepted, we should also make almost the same change for ```parquetFile```. cc felixcheung sun-rui shivaram Author: Yanbo LiangCloses #10145 from yanboliang/spark-12146. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0fb98255 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0fb98255 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0fb98255 Branch: refs/heads/master Commit: 0fb9825556dbbcc98d7eafe9ddea8676301e09bb Parents: c119a34 Author: Yanbo Liang Authored: Fri Dec 11 11:47:35 2015 -0800 Committer: Shivaram Venkataraman Committed: Fri Dec 11 11:47:35 2015 -0800 -- R/pkg/NAMESPACE | 1 + R/pkg/R/DataFrame.R | 102 ++--- R/pkg/R/SQLContext.R | 29 +++--- R/pkg/inst/tests/testthat/test_sparkSQL.R | 120 ++--- examples/src/main/r/dataframe.R | 2 +- 5 files changed, 138 insertions(+), 116 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0fb98255/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index ba64bc5..cab39d6 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -267,6 +267,7 @@ export("as.DataFrame", "createExternalTable", "dropTempTable", "jsonFile", + "read.json", "loadDF", "parquetFile", "read.df", http://git-wip-us.apache.org/repos/asf/spark/blob/0fb98255/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index f4c4a25..975b058 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -24,14 +24,14 @@ setOldClass("jobj") #' @title S4 class that represents a DataFrame #' @description DataFrames can be created using functions like \link{createDataFrame}, -#' \link{jsonFile}, \link{table} etc. +#' \link{read.json}, \link{table} etc. #' @family DataFrame functions #' @rdname DataFrame #' @docType class #' #' @slot env An R environment that stores bookkeeping states of the DataFrame #' @slot sdf A Java object reference to the backing Scala DataFrame -#' @seealso \link{createDataFrame}, \link{jsonFile}, \link{table} +#' @seealso \link{createDataFrame}, \link{read.json}, \link{table} #' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes} #' @export #' @examples @@ -77,7 +77,7 @@ dataFrame <- function(sdf, isCached = FALSE) { #' sc <- sparkR.init() #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" -#' df <- jsonFile(sqlContext, path) +#' df <- read.json(sqlContext, path) #' printSchema(df) #'} setMethod("printSchema", @@ -102,7 +102,7 @@ setMethod("printSchema", #' sc <- sparkR.init() #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" -#' df <- jsonFile(sqlContext, path) +#' df <- read.json(sqlContext, path) #' dfSchema <- schema(df) #'} setMethod("schema", @@ -126,7 +126,7 @@ setMethod("schema", #' sc <- sparkR.init() #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" -#' df <- jsonFile(sqlContext, path) +#' df <- read.json(sqlContext, path) #' explain(df, TRUE) #'} setMethod("explain", @@ -157,7 +157,7 @@ setMethod("explain", #' sc <- sparkR.init() #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" -#' df <- jsonFile(sqlContext, path) +#' df <- read.json(sqlContext, path) #' isLocal(df) #'} setMethod("isLocal", @@ -182,7 +182,7 @@ setMethod("isLocal", #' sc <- sparkR.init() #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" -#' df <- jsonFile(sqlContext, path) +#' df <- read.json(sqlContext, path) #' showDF(df) #'} setMethod("showDF", @@ -207,7 +207,7 @@ setMethod("showDF", #' sc <- sparkR.init() #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" -#' df <- jsonFile(sqlContext, path) +#' df <- read.json(sqlContext, path) #' df #'} setMethod("show", "DataFrame", @@ -234,7 +234,7 @@ setMethod("show",
spark git commit: [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files
Repository: spark Updated Branches: refs/heads/branch-1.6 2e4523161 -> f05bae4a3 [SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files * ```jsonFile``` should support multiple input files, such as: ```R jsonFile(sqlContext, c(âpath1â, âpath2â)) # character vector as arguments jsonFile(sqlContext, âpath1,path2â) ``` * Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side. * Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case. * If this PR is accepted, we should also make almost the same change for ```parquetFile```. cc felixcheung sun-rui shivaram Author: Yanbo LiangCloses #10145 from yanboliang/spark-12146. (cherry picked from commit 0fb9825556dbbcc98d7eafe9ddea8676301e09bb) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f05bae4a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f05bae4a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f05bae4a Branch: refs/heads/branch-1.6 Commit: f05bae4a30c422f0d0b2ab1e41d32e9d483fa675 Parents: 2e45231 Author: Yanbo Liang Authored: Fri Dec 11 11:47:35 2015 -0800 Committer: Shivaram Venkataraman Committed: Fri Dec 11 11:47:43 2015 -0800 -- R/pkg/NAMESPACE | 1 + R/pkg/R/DataFrame.R | 102 ++--- R/pkg/R/SQLContext.R | 29 +++--- R/pkg/inst/tests/testthat/test_sparkSQL.R | 120 ++--- examples/src/main/r/dataframe.R | 2 +- 5 files changed, 138 insertions(+), 116 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f05bae4a/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index ba64bc5..cab39d6 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -267,6 +267,7 @@ export("as.DataFrame", "createExternalTable", "dropTempTable", "jsonFile", + "read.json", "loadDF", "parquetFile", "read.df", http://git-wip-us.apache.org/repos/asf/spark/blob/f05bae4a/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index f4c4a25..975b058 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -24,14 +24,14 @@ setOldClass("jobj") #' @title S4 class that represents a DataFrame #' @description DataFrames can be created using functions like \link{createDataFrame}, -#' \link{jsonFile}, \link{table} etc. +#' \link{read.json}, \link{table} etc. #' @family DataFrame functions #' @rdname DataFrame #' @docType class #' #' @slot env An R environment that stores bookkeeping states of the DataFrame #' @slot sdf A Java object reference to the backing Scala DataFrame -#' @seealso \link{createDataFrame}, \link{jsonFile}, \link{table} +#' @seealso \link{createDataFrame}, \link{read.json}, \link{table} #' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes} #' @export #' @examples @@ -77,7 +77,7 @@ dataFrame <- function(sdf, isCached = FALSE) { #' sc <- sparkR.init() #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" -#' df <- jsonFile(sqlContext, path) +#' df <- read.json(sqlContext, path) #' printSchema(df) #'} setMethod("printSchema", @@ -102,7 +102,7 @@ setMethod("printSchema", #' sc <- sparkR.init() #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" -#' df <- jsonFile(sqlContext, path) +#' df <- read.json(sqlContext, path) #' dfSchema <- schema(df) #'} setMethod("schema", @@ -126,7 +126,7 @@ setMethod("schema", #' sc <- sparkR.init() #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" -#' df <- jsonFile(sqlContext, path) +#' df <- read.json(sqlContext, path) #' explain(df, TRUE) #'} setMethod("explain", @@ -157,7 +157,7 @@ setMethod("explain", #' sc <- sparkR.init() #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" -#' df <- jsonFile(sqlContext, path) +#' df <- read.json(sqlContext, path) #' isLocal(df) #'} setMethod("isLocal", @@ -182,7 +182,7 @@ setMethod("isLocal", #' sc <- sparkR.init() #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" -#' df <- jsonFile(sqlContext, path) +#' df <- read.json(sqlContext, path) #' showDF(df) #'} setMethod("showDF", @@ -207,7 +207,7 @@ setMethod("showDF", #' sc <- sparkR.init() #' sqlContext <- sparkRSQL.init(sc) #' path <- "path/to/file.json" -#' df <-
spark git commit: [SPARK-12258] [SQL] passing null into ScalaUDF (follow-up)
Repository: spark Updated Branches: refs/heads/master 518ab5101 -> c119a34d1 [SPARK-12258] [SQL] passing null into ScalaUDF (follow-up) This is a follow-up PR for #10259 Author: Davies LiuCloses #10266 from davies/null_udf2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c119a34d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c119a34d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c119a34d Branch: refs/heads/master Commit: c119a34d1e9e599e302acfda92e5de681086a19f Parents: 518ab51 Author: Davies Liu Authored: Fri Dec 11 11:15:53 2015 -0800 Committer: Davies Liu Committed: Fri Dec 11 11:15:53 2015 -0800 -- .../sql/catalyst/expressions/ScalaUDF.scala | 31 +++- .../org/apache/spark/sql/DataFrameSuite.scala | 8 +++-- 2 files changed, 23 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c119a34d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala index 5deb2f8..85faa19 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala @@ -1029,24 +1029,27 @@ case class ScalaUDF( // such as IntegerType, its javaType is `int` and the returned type of user-defined // function is Object. Trying to convert an Object to `int` will cause casting exception. val evalCode = evals.map(_.code).mkString -val funcArguments = converterTerms.zipWithIndex.map { - case (converter, i) => -val eval = evals(i) -val dt = children(i).dataType -s"$converter.apply(${eval.isNull} ? null : (${ctx.boxedType(dt)}) ${eval.value})" -}.mkString(",") -val callFunc = s"${ctx.boxedType(ctx.javaType(dataType))} $resultTerm = " + - s"(${ctx.boxedType(ctx.javaType(dataType))})${catalystConverterTerm}" + -s".apply($funcTerm.apply($funcArguments));" +val (converters, funcArguments) = converterTerms.zipWithIndex.map { case (converter, i) => + val eval = evals(i) + val argTerm = ctx.freshName("arg") + val convert = s"Object $argTerm = ${eval.isNull} ? null : $converter.apply(${eval.value});" + (convert, argTerm) +}.unzip -evalCode + s""" - ${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)}; - Boolean ${ev.isNull}; +val callFunc = s"${ctx.boxedType(dataType)} $resultTerm = " + + s"(${ctx.boxedType(dataType)})${catalystConverterTerm}" + +s".apply($funcTerm.apply(${funcArguments.mkString(", ")}));" +s""" + $evalCode + ${converters.mkString("\n")} $callFunc - ${ev.value} = $resultTerm; - ${ev.isNull} = $resultTerm == null; + boolean ${ev.isNull} = $resultTerm == null; + ${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)}; + if (!${ev.isNull}) { +${ev.value} = $resultTerm; + } """ } http://git-wip-us.apache.org/repos/asf/spark/blob/c119a34d/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala index 8887dc6..5353fef 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala @@ -1144,9 +1144,13 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { // passing null into the UDF that could handle it val boxedUDF = udf[java.lang.Integer, java.lang.Integer] { - (i: java.lang.Integer) => if (i == null) -10 else i * 2 + (i: java.lang.Integer) => if (i == null) -10 else null } -checkAnswer(df.select(boxedUDF($"age")), Row(44) :: Row(-10) :: Nil) +checkAnswer(df.select(boxedUDF($"age")), Row(null) :: Row(-10) :: Nil) + +sqlContext.udf.register("boxedUDF", + (i: java.lang.Integer) => (if (i == null) -10 else null): java.lang.Integer) +checkAnswer(sql("select boxedUDF(null), boxedUDF(-1)"), Row(-10, null) :: Nil) val primitiveUDF = udf((i: Int) => i * 2) checkAnswer(df.select(primitiveUDF($"age")), Row(44) :: Row(null) :: Nil) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.6.0-rc2 [deleted] 3e39925f9 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.6.0-rc2 [created] 23f8dfd45 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[2/2] spark git commit: Preparing development version 1.6.0-SNAPSHOT
Preparing development version 1.6.0-SNAPSHOT Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2e452316 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2e452316 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2e452316 Branch: refs/heads/branch-1.6 Commit: 2e4523161ddf2417f2570bb75cc2d6694813adf5 Parents: 23f8dfd Author: Patrick WendellAuthored: Fri Dec 11 11:25:09 2015 -0800 Committer: Patrick Wendell Committed: Fri Dec 11 11:25:09 2015 -0800 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- docker-integration-tests/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt-assembly/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tags/pom.xml| 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 35 files changed, 35 insertions(+), 35 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index fbabaa5..4b60ee0 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0 +1.6.0-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 1b3e417..672e946 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0 +1.6.0-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 15b8d75..61744bb 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0 +1.6.0-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/docker-integration-tests/pom.xml -- diff --git a/docker-integration-tests/pom.xml b/docker-integration-tests/pom.xml index d579879..39d3f34 100644 --- a/docker-integration-tests/pom.xml +++ b/docker-integration-tests/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.10 -1.6.0 +1.6.0-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index 37b15bb..f5ab2a7 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0 +1.6.0-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index 295455a..dceedcf 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0 +1.6.0-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/2e452316/external/flume-sink/pom.xml -- diff --git a/external/flume-sink/pom.xml
[1/2] spark git commit: Preparing Spark release v1.6.0-rc2
Repository: spark Updated Branches: refs/heads/branch-1.6 eec36607f -> 2e4523161 Preparing Spark release v1.6.0-rc2 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/23f8dfd4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/23f8dfd4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/23f8dfd4 Branch: refs/heads/branch-1.6 Commit: 23f8dfd45187cb8f2216328ab907ddb5fbdffd0b Parents: eec3660 Author: Patrick WendellAuthored: Fri Dec 11 11:25:03 2015 -0800 Committer: Patrick Wendell Committed: Fri Dec 11 11:25:03 2015 -0800 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- docker-integration-tests/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt-assembly/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tags/pom.xml| 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 35 files changed, 35 insertions(+), 35 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 4b60ee0..fbabaa5 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0-SNAPSHOT +1.6.0 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 672e946..1b3e417 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0-SNAPSHOT +1.6.0 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 61744bb..15b8d75 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0-SNAPSHOT +1.6.0 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/docker-integration-tests/pom.xml -- diff --git a/docker-integration-tests/pom.xml b/docker-integration-tests/pom.xml index 39d3f34..d579879 100644 --- a/docker-integration-tests/pom.xml +++ b/docker-integration-tests/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.10 -1.6.0-SNAPSHOT +1.6.0 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index f5ab2a7..37b15bb 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0-SNAPSHOT +1.6.0 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index dceedcf..295455a 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.0-SNAPSHOT +1.6.0 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/23f8dfd4/external/flume-sink/pom.xml
spark git commit: [SPARK-11080] [SQL] Incorporate per-JVM id into ExprId to prevent unsafe cross-JVM comparisions
Repository: spark Updated Branches: refs/heads/branch-1.5 cb0246c93 -> 5e603a51c [SPARK-11080] [SQL] Incorporate per-JVM id into ExprId to prevent unsafe cross-JVM comparisions In the current implementation of named expressions' `ExprIds`, we rely on a per-JVM AtomicLong to ensure that expression ids are unique within a JVM. However, these expression ids will not be _globally_ unique. This opens the potential for id collisions if new expression ids happen to be created inside of tasks rather than on the driver. There are currently a few cases where tasks allocate expression ids, which happen to be safe because those expressions are never compared to expressions created on the driver. In order to guard against the introduction of invalid comparisons between driver-created and executor-created expression ids, this patch extends `ExprId` to incorporate a UUID to identify the JVM that created the id, which prevents collisions. Author: Josh RosenCloses #9093 from JoshRosen/SPARK-11080. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5e603a51 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5e603a51 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5e603a51 Branch: refs/heads/branch-1.5 Commit: 5e603a51c09a94280c346bee12def0c49479d069 Parents: cb0246c Author: Josh Rosen Authored: Tue Oct 13 15:09:31 2015 -0700 Committer: Davies Liu Committed: Fri Dec 11 12:37:54 2015 -0800 -- .../sql/catalyst/expressions/namedExpressions.scala | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5e603a51/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala index 5768c60..8957df0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala @@ -17,6 +17,8 @@ package org.apache.spark.sql.catalyst.expressions +import java.util.UUID + import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute import org.apache.spark.sql.catalyst.expressions.codegen._ @@ -24,16 +26,23 @@ import org.apache.spark.sql.types._ object NamedExpression { private val curId = new java.util.concurrent.atomic.AtomicLong() - def newExprId: ExprId = ExprId(curId.getAndIncrement()) + private[expressions] val jvmId = UUID.randomUUID() + def newExprId: ExprId = ExprId(curId.getAndIncrement(), jvmId) def unapply(expr: NamedExpression): Option[(String, DataType)] = Some(expr.name, expr.dataType) } /** - * A globally unique (within this JVM) id for a given named expression. + * A globally unique id for a given named expression. * Used to identify which attribute output by a relation is being * referenced in a subsequent computation. + * + * The `id` field is unique within a given JVM, while the `uuid` is used to uniquely identify JVMs. */ -case class ExprId(id: Long) +case class ExprId(id: Long, jvmId: UUID) + +object ExprId { + def apply(id: Long): ExprId = ExprId(id, NamedExpression.jvmId) +} /** * An [[Expression]] that is named. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions
Repository: spark Updated Branches: refs/heads/master a0ff6d16e -> 1e799d617 [SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions Modifies the String overload to call the Column overload and ensures this is called in a test. Author: Ankur DaveCloses #10271 from ankurdave/SPARK-12298. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e799d61 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e799d61 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e799d61 Branch: refs/heads/master Commit: 1e799d617a28cd0eaa8f22d103ea8248c4655ae5 Parents: a0ff6d1 Author: Ankur Dave Authored: Fri Dec 11 19:07:48 2015 -0800 Committer: Yin Huai Committed: Fri Dec 11 19:07:48 2015 -0800 -- sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala | 2 +- .../src/test/scala/org/apache/spark/sql/DataFrameSuite.scala | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1e799d61/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala index da180a2..497bd48 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala @@ -609,7 +609,7 @@ class DataFrame private[sql]( */ @scala.annotation.varargs def sortWithinPartitions(sortCol: String, sortCols: String*): DataFrame = { -sortWithinPartitions(sortCol, sortCols : _*) +sortWithinPartitions((sortCol +: sortCols).map(Column(_)) : _*) } /** http://git-wip-us.apache.org/repos/asf/spark/blob/1e799d61/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala index 5353fef..c0bbf73 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala @@ -1090,8 +1090,8 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { } // Distribute into one partition and order by. This partition should contain all the values. -val df6 = data.repartition(1, $"a").sortWithinPartitions($"b".asc) -// Walk each partition and verify that it is sorted descending and not globally sorted. +val df6 = data.repartition(1, $"a").sortWithinPartitions("b") +// Walk each partition and verify that it is sorted ascending and not globally sorted. df6.rdd.foreachPartition { p => var previousValue: Int = -1 var allSequential: Boolean = true - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions
Repository: spark Updated Branches: refs/heads/branch-1.6 c2f20469d -> 03d801587 [SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions Modifies the String overload to call the Column overload and ensures this is called in a test. Author: Ankur DaveCloses #10271 from ankurdave/SPARK-12298. (cherry picked from commit 1e799d617a28cd0eaa8f22d103ea8248c4655ae5) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/03d80158 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/03d80158 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/03d80158 Branch: refs/heads/branch-1.6 Commit: 03d801587936fe92d4e7541711f1f41965e64956 Parents: c2f2046 Author: Ankur Dave Authored: Fri Dec 11 19:07:48 2015 -0800 Committer: Yin Huai Committed: Fri Dec 11 19:08:03 2015 -0800 -- sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala | 2 +- .../src/test/scala/org/apache/spark/sql/DataFrameSuite.scala | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/03d80158/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala index 1acfe84..cc8b70b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala @@ -609,7 +609,7 @@ class DataFrame private[sql]( */ @scala.annotation.varargs def sortWithinPartitions(sortCol: String, sortCols: String*): DataFrame = { -sortWithinPartitions(sortCol, sortCols : _*) +sortWithinPartitions((sortCol +: sortCols).map(Column(_)) : _*) } /** http://git-wip-us.apache.org/repos/asf/spark/blob/03d80158/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala index 1763eb5..854dec0 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala @@ -1083,8 +1083,8 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { } // Distribute into one partition and order by. This partition should contain all the values. -val df6 = data.repartition(1, $"a").sortWithinPartitions($"b".asc) -// Walk each partition and verify that it is sorted descending and not globally sorted. +val df6 = data.repartition(1, $"a").sortWithinPartitions("b") +// Walk each partition and verify that it is sorted ascending and not globally sorted. df6.rdd.foreachPartition { p => var previousValue: Int = -1 var allSequential: Boolean = true - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-11964][DOCS][ML] Add in Pipeline Import/Export Documentation
Repository: spark Updated Branches: refs/heads/master 0fb982555 -> aa305dcaf [SPARK-11964][DOCS][ML] Add in Pipeline Import/Export Documentation Adding in Pipeline Import and Export Documentation. Author: anabranchAuthor: Bill Chambers Closes #10179 from anabranch/master. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aa305dca Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aa305dca Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aa305dca Branch: refs/heads/master Commit: aa305dcaf5b4148aba9e669e081d0b9235f50857 Parents: 0fb9825 Author: anabranch Authored: Fri Dec 11 12:55:56 2015 -0800 Committer: Joseph K. Bradley Committed: Fri Dec 11 12:55:56 2015 -0800 -- docs/ml-guide.md | 13 + 1 file changed, 13 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/aa305dca/docs/ml-guide.md -- diff --git a/docs/ml-guide.md b/docs/ml-guide.md index 5c96c2b..44a316a 100644 --- a/docs/ml-guide.md +++ b/docs/ml-guide.md @@ -192,6 +192,10 @@ Parameters belong to specific instances of `Estimator`s and `Transformer`s. For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`. This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`. +## Saving and Loading Pipelines + +Often times it is worth it to save a model or a pipeline to disk for later use. In Spark 1.6, a model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML models. Please refer to the algorithm's API documentation to see if saving and loading is supported. + # Code examples This section gives code examples illustrating the functionality discussed above. @@ -455,6 +459,15 @@ val pipeline = new Pipeline() // Fit the pipeline to training documents. val model = pipeline.fit(training) +// now we can optionally save the fitted pipeline to disk +model.save("/tmp/spark-logistic-regression-model") + +// we can also save this unfit pipeline to disk +pipeline.save("/tmp/unfit-lr-model") + +// and load it back in during production +val sameModel = Pipeline.load("/tmp/spark-logistic-regression-model") + // Prepare test documents, which are unlabeled (id, text) tuples. val test = sqlContext.createDataFrame(Seq( (4L, "spark i j k"), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org