from:"jkbradley"

spark git commit: [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception

2018-09-06 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-2.4 b632e775c -> 085f731ad


[SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization 
Exception

## What changes were proposed in this pull request?
mapValues in scala is currently not serializable. To avoid the serialization 
issue while running pageRank, we need to use map instead of mapValues.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Closes #22271 from shahidki31/master_latest.

Authored-by: Shahid 
Signed-off-by: Joseph K. Bradley 
(cherry picked from commit 3b6591b0b064b13a411e5b8f8ee4883a69c39e2d)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/085f731a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/085f731a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/085f731a

Branch: refs/heads/branch-2.4
Commit: 085f731adb9b8c82a2bf4bbcae6d889a967fbd53
Parents: b632e77
Author: Shahid 
Authored: Thu Sep 6 09:52:58 2018 -0700
Committer: Joseph K. Bradley 
Committed: Thu Sep 6 09:53:07 2018 -0700

--
 .../main/scala/org/apache/spark/graphx/lib/PageRank.scala| 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/085f731a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
--
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
index 96b635f..1305c05 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
@@ -198,9 +198,11 @@ object PageRank extends Logging {
 
 val zero = Vectors.sparse(sources.size, List()).asBreeze
 // map of vid -> vector where for each vid, the _position of vid in 
source_ is set to 1.0
-val sourcesInitMap = sources.zipWithIndex.toMap.mapValues { i =>
-  Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze
-}
+val sourcesInitMap = sources.zipWithIndex.map { case (vid, i) =>
+  val v = Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze
+  (vid, v)
+}.toMap
+
 val sc = graph.vertices.sparkContext
 val sourcesInitMapBC = sc.broadcast(sourcesInitMap)
 // Initialize the PageRank graph with each edge attribute having


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception

2018-09-06 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 7ef6d1daf -> 3b6591b0b


[SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization 
Exception

## What changes were proposed in this pull request?
mapValues in scala is currently not serializable. To avoid the serialization 
issue while running pageRank, we need to use map instead of mapValues.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Closes #22271 from shahidki31/master_latest.

Authored-by: Shahid 
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3b6591b0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3b6591b0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3b6591b0

Branch: refs/heads/master
Commit: 3b6591b0b064b13a411e5b8f8ee4883a69c39e2d
Parents: 7ef6d1d
Author: Shahid 
Authored: Thu Sep 6 09:52:58 2018 -0700
Committer: Joseph K. Bradley 
Committed: Thu Sep 6 09:52:58 2018 -0700

--
 .../main/scala/org/apache/spark/graphx/lib/PageRank.scala| 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3b6591b0/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
--
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
index 96b635f..1305c05 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
@@ -198,9 +198,11 @@ object PageRank extends Logging {
 
 val zero = Vectors.sparse(sources.size, List()).asBreeze
 // map of vid -> vector where for each vid, the _position of vid in 
source_ is set to 1.0
-val sourcesInitMap = sources.zipWithIndex.toMap.mapValues { i =>
-  Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze
-}
+val sourcesInitMap = sources.zipWithIndex.map { case (vid, i) =>
+  val v = Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze
+  (vid, v)
+}.toMap
+
 val sc = graph.vertices.sparkContext
 val sourcesInitMapBC = sc.broadcast(sourcesInitMap)
 // Initialize the PageRank graph with each edge attribute having


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] spark issue #22271: [SPARK-25268][GraphX]run Parallel Personalized PageRank ...

2018-09-06 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/22271
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22271: [SPARK-25268][GraphX]run Parallel Personalized PageRank ...

2018-09-05 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/22271
  
LGTM
I tested this locally and confirmed it fixes the serialization issue.
Thank you @shahidki31 !
Merging with master after fresh tests finish


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values backport to 2.3

2018-08-24 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 42c1fdd22 -> f5983823e


[SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values 
backport to 2.3

## What changes were proposed in this pull request?
In feature.py, VectorSizeHint setSize and getSize don't return value. Add 
return.

(Please fill in changes proposed in this fix)

## How was this patch tested?

Unit Test added

Closes #8 from huaxingao/spark-25124-2.3.

Authored-by: Huaxin Gao 
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5983823
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5983823
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5983823

Branch: refs/heads/branch-2.3
Commit: f5983823e9b4a3b4762481306ea071a73f5742fc
Parents: 42c1fdd
Author: Huaxin Gao 
Authored: Fri Aug 24 15:41:18 2018 -0700
Committer: Joseph K. Bradley 
Committed: Fri Aug 24 15:41:18 2018 -0700

--
 python/pyspark/ml/feature.py |  4 ++--
 python/pyspark/ml/tests.py   | 17 +
 2 files changed, 19 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f5983823/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 04b07e6..a444fe0 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -3673,12 +3673,12 @@ class VectorSizeHint(JavaTransformer, HasInputCol, 
HasHandleInvalid, JavaMLReada
 @since("2.3.0")
 def getSize(self):
 """ Gets size param, the size of vectors in `inputCol`."""
-self.getOrDefault(self.size)
+return self.getOrDefault(self.size)
 
 @since("2.3.0")
 def setSize(self, value):
 """ Sets size param, the size of vectors in `inputCol`."""
-self._set(size=value)
+return self._set(size=value)
 
 
 if __name__ == "__main__":

http://git-wip-us.apache.org/repos/asf/spark/blob/f5983823/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 1af2b91..49912d2 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -678,6 +678,23 @@ class FeatureTests(SparkSessionTestCase):
 expected2 = [Row(id=0, indexed=0.0), Row(id=1, indexed=1.0)]
 self.assertEqual(actual2, expected2)
 
+def test_vector_size_hint(self):
+df = self.spark.createDataFrame(
+[(0, Vectors.dense([0.0, 10.0, 0.5])),
+ (1, Vectors.dense([1.0, 11.0, 0.5, 0.6])),
+ (2, Vectors.dense([2.0, 12.0]))],
+["id", "vector"])
+
+sizeHint = VectorSizeHint(
+inputCol="vector",
+handleInvalid="skip")
+sizeHint.setSize(3)
+self.assertEqual(sizeHint.getSize(), 3)
+
+output = sizeHint.transform(df).head().vector
+expected = DenseVector([0.0, 10.0, 0.5])
+self.assertEqual(output, expected)
+
 
 class HasInducedError(Params):
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] spark issue #22228: [SPARK-25124][ML]VectorSizeHint setSize and getSize don'...

2018-08-24 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/8
  
Awesome, thank you!
LGTM
Merging with branch-2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSize don'...

2018-08-23 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/22136
  
Well, this merged successfully with master but not with 2.3; it seemed to 
pull in code from another PR, strangely.  Would you mind sending a backport PR 
against branch-2.3?  Thank you!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values

2018-08-23 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 8ed044928 -> b5e118808


[SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values

## What changes were proposed in this pull request?

In feature.py, VectorSizeHint setSize and getSize don't return value. Add 
return.

## How was this patch tested?

I tested the changes on my local.

Closes #22136 from huaxingao/spark-25124.

Authored-by: Huaxin Gao 
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b5e11880
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b5e11880
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b5e11880

Branch: refs/heads/master
Commit: b5e11880871d6ef31efe3ec42b3caa0fc403e71b
Parents: 8ed0449
Author: Huaxin Gao 
Authored: Thu Aug 23 16:17:27 2018 -0700
Committer: Joseph K. Bradley 
Committed: Thu Aug 23 16:17:27 2018 -0700

--
 python/pyspark/ml/feature.py |  4 ++--
 python/pyspark/ml/tests.py   | 17 +
 2 files changed, 19 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b5e11880/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index ddba738..760aa82 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -3843,12 +3843,12 @@ class VectorSizeHint(JavaTransformer, HasInputCol, 
HasHandleInvalid, JavaMLReada
 @since("2.3.0")
 def getSize(self):
 """ Gets size param, the size of vectors in `inputCol`."""
-self.getOrDefault(self.size)
+return self.getOrDefault(self.size)
 
 @since("2.3.0")
 def setSize(self, value):
 """ Sets size param, the size of vectors in `inputCol`."""
-self._set(size=value)
+return self._set(size=value)
 
 
 if __name__ == "__main__":

http://git-wip-us.apache.org/repos/asf/spark/blob/b5e11880/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index a770bad..5c87d1d 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -844,6 +844,23 @@ class FeatureTests(SparkSessionTestCase):
 
.select(model_default.getOrDefault(model_default.outputCol)).collect()
 self.assertEqual(len(transformed_list), 5)
 
+def test_vector_size_hint(self):
+df = self.spark.createDataFrame(
+[(0, Vectors.dense([0.0, 10.0, 0.5])),
+ (1, Vectors.dense([1.0, 11.0, 0.5, 0.6])),
+ (2, Vectors.dense([2.0, 12.0]))],
+["id", "vector"])
+
+sizeHint = VectorSizeHint(
+inputCol="vector",
+handleInvalid="skip")
+sizeHint.setSize(3)
+self.assertEqual(sizeHint.getSize(), 3)
+
+output = sizeHint.transform(df).head().vector
+expected = DenseVector([0.0, 10.0, 0.5])
+self.assertEqual(output, expected)
+
 
 class HasInducedError(Params):
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] spark issue #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSize don'...

2018-08-23 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/22136
  
LGTM
Merging with master.  I'll try to backport it to 2.3 too.
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSi...

2018-08-23 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/22136#discussion_r212483950
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -844,6 +844,28 @@ def test_string_indexer_from_labels(self):
 
.select(model_default.getOrDefault(model_default.outputCol)).collect()
 self.assertEqual(len(transformed_list), 5)
 
+def test_vector_size_hint(self):
--- End diff --

Thanks!  FYI this still isn't really testing the return value of setSize, 
but I think it's OK since we don't really do that anywhere else : P and I'm 
confident in the above change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSi...

2018-08-22 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/22136#discussion_r212069129
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -844,6 +844,28 @@ def test_string_indexer_from_labels(self):
 
.select(model_default.getOrDefault(model_default.outputCol)).collect()
 self.assertEqual(len(transformed_list), 5)
 
+def test_vector_size_hint(self):
--- End diff --

This test doesn't test the 2 functions which were buggy.  Could you please 
test those (and simplify the test if possible)?  Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-25149][GRAPHX] Update Parallel Personalized Page Rank to test with large vertexIds

2018-08-21 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 99d2e4e00 -> 72ecfd095


[SPARK-25149][GRAPHX] Update Parallel Personalized Page Rank to test with large 
vertexIds

## What changes were proposed in this pull request?

runParallelPersonalizedPageRank in graphx checks that `sources` are <= 
Int.MaxValue.toLong, but this is not actually required. This check seems to 
have been added because we use sparse vectors in the implementation and sparse 
vectors cannot be indexed by values > MAX_INT. However we do not ever index the 
sparse vector by the source vertexIds so this isn't an issue. I've added a test 
with large vertexIds to confirm this works as expected.

## How was this patch tested?

Unit tests.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Closes #22139 from MrBago/remove-veretexId-check-pppr.

Authored-by: Bago Amirbekian 
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/72ecfd09
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/72ecfd09
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/72ecfd09

Branch: refs/heads/master
Commit: 72ecfd095062ad61c073f9b97bf3c47644575d60
Parents: 99d2e4e
Author: Bago Amirbekian 
Authored: Tue Aug 21 15:21:55 2018 -0700
Committer: Joseph K. Bradley 
Committed: Tue Aug 21 15:21:55 2018 -0700

--
 .../org/apache/spark/graphx/lib/PageRank.scala  | 28 ++---
 .../apache/spark/graphx/lib/PageRankSuite.scala | 32 +++-
 2 files changed, 35 insertions(+), 25 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/72ecfd09/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
--
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
index ebd65e8..96b635f 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
@@ -184,9 +184,11 @@ object PageRank extends Logging {
* indexed by the position of nodes in the sources list) and
* edge attributes the normalized edge weight
*/
-  def runParallelPersonalizedPageRank[VD: ClassTag, ED: ClassTag](graph: 
Graph[VD, ED],
-numIter: Int, resetProb: Double = 0.15,
-sources: Array[VertexId]): Graph[Vector, Double] = {
+  def runParallelPersonalizedPageRank[VD: ClassTag, ED: ClassTag](
+  graph: Graph[VD, ED],
+  numIter: Int,
+  resetProb: Double = 0.15,
+  sources: Array[VertexId]): Graph[Vector, Double] = {
 require(numIter > 0, s"Number of iterations must be greater than 0," +
   s" but got ${numIter}")
 require(resetProb >= 0 && resetProb <= 1, s"Random reset probability must 
belong" +
@@ -194,15 +196,11 @@ object PageRank extends Logging {
 require(sources.nonEmpty, s"The list of sources must be non-empty," +
   s" but got ${sources.mkString("[", ",", "]")}")
 
-// TODO if one sources vertex id is outside of the int range
-// we won't be able to store its activations in a sparse vector
-require(sources.max <= Int.MaxValue.toLong,
-  s"This implementation currently only works for source vertex ids at most 
${Int.MaxValue}")
 val zero = Vectors.sparse(sources.size, List()).asBreeze
-val sourcesInitMap = sources.zipWithIndex.map { case (vid, i) =>
-  val v = Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze
-  (vid, v)
-}.toMap
+// map of vid -> vector where for each vid, the _position of vid in 
source_ is set to 1.0
+val sourcesInitMap = sources.zipWithIndex.toMap.mapValues { i =>
+  Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze
+}
 val sc = graph.vertices.sparkContext
 val sourcesInitMapBC = sc.broadcast(sourcesInitMap)
 // Initialize the PageRank graph with each edge attribute having
@@ -212,13 +210,7 @@ object PageRank extends Logging {
   .outerJoinVertices(graph.outDegrees) { (vid, vdata, deg) => 
deg.getOrElse(0) }
   // Set the weight on the edges based on the degree
   .mapTriplets(e => 1.0 / e.srcAttr, TripletFields.Src)
-  .mapVertices { (vid, attr) =>
-if (sourcesInitMapBC.value contains vid) {
-  sourcesInitMapBC.value(vid)
-} else {
-  zero
-}
-  }
+  .mapVertices((vid, _) => sourcesInitMapBC.value.getOrElse(vid, zero))
 
 var i = 0
 while (i < numIter) {

http://git-wip-us.apache.org/repos/asf/spark/blob/72ecfd09/graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala
--
diff --git

[GitHub] spark issue #22139: [SPARK-25149][GraphX] Update Parallel Personalized Page ...

2018-08-21 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/22139
  
LGTM
I'll merge this with master
Thanks @MrBago !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-24852][ML] Update spark.ml to use Instrumentation.instrumented.

2018-07-20 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 244bcff19 -> 3cb1b5780


[SPARK-24852][ML] Update spark.ml to use Instrumentation.instrumented.

## What changes were proposed in this pull request?

Followup for #21719.
Update spark.ml training code to fully wrap instrumented methods and remove old 
instrumentation APIs.

## How was this patch tested?

existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: Bago Amirbekian 

Closes #21799 from MrBago/new-instrumentation-apis2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3cb1b578
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3cb1b578
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3cb1b578

Branch: refs/heads/master
Commit: 3cb1b57809d0b4a93223669f5c10cea8fc53eff6
Parents: 244bcff
Author: Bago Amirbekian 
Authored: Fri Jul 20 12:13:15 2018 -0700
Committer: Joseph K. Bradley 
Committed: Fri Jul 20 12:13:15 2018 -0700

--
 .../classification/DecisionTreeClassifier.scala | 24 +-
 .../spark/ml/classification/GBTClassifier.scala | 14 +++---
 .../spark/ml/classification/LinearSVC.scala | 12 ++---
 .../ml/classification/LogisticRegression.scala  |  2 +-
 .../MultilayerPerceptronClassifier.scala| 14 +++---
 .../spark/ml/classification/NaiveBayes.scala| 12 ++---
 .../spark/ml/classification/OneVsRest.scala |  9 ++--
 .../classification/RandomForestClassifier.scala | 13 +++---
 .../spark/ml/clustering/BisectingKMeans.scala   | 12 ++---
 .../spark/ml/clustering/GaussianMixture.scala   | 12 ++---
 .../org/apache/spark/ml/clustering/KMeans.scala |  9 ++--
 .../org/apache/spark/ml/clustering/LDA.scala| 12 ++---
 .../org/apache/spark/ml/fpm/FPGrowth.scala  | 12 ++---
 .../apache/spark/ml/recommendation/ALS.scala|  9 ++--
 .../ml/regression/AFTSurvivalRegression.scala   | 13 +++---
 .../ml/regression/DecisionTreeRegressor.scala   | 24 +-
 .../spark/ml/regression/GBTRegressor.scala  | 12 ++---
 .../GeneralizedLinearRegression.scala   | 12 ++---
 .../ml/regression/IsotonicRegression.scala  | 12 ++---
 .../spark/ml/regression/LinearRegression.scala  | 21 -
 .../ml/regression/RandomForestRegressor.scala   | 14 +++---
 .../apache/spark/ml/tuning/CrossValidator.scala |  9 ++--
 .../spark/ml/tuning/TrainValidationSplit.scala  |  9 ++--
 .../apache/spark/ml/util/Instrumentation.scala  | 47 ++--
 24 files changed, 153 insertions(+), 186 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3cb1b578/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
index c9786f1..8a57bfc 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
@@ -29,6 +29,7 @@ import org.apache.spark.ml.tree._
 import org.apache.spark.ml.tree.DecisionTreeModelReadWrite._
 import org.apache.spark.ml.tree.impl.RandomForest
 import org.apache.spark.ml.util._
+import org.apache.spark.ml.util.Instrumentation.instrumented
 import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo, Strategy => 
OldStrategy}
 import org.apache.spark.mllib.tree.model.{DecisionTreeModel => 
OldDecisionTreeModel}
 import org.apache.spark.rdd.RDD
@@ -96,8 +97,10 @@ class DecisionTreeClassifier @Since("1.4.0") (
   @Since("1.6.0")
   override def setSeed(value: Long): this.type = set(seed, value)
 
-  override protected def train(dataset: Dataset[_]): 
DecisionTreeClassificationModel = {
-val instr = Instrumentation.create(this, dataset)
+  override protected def train(
+  dataset: Dataset[_]): DecisionTreeClassificationModel = instrumented { 
instr =>
+instr.logPipelineStage(this)
+instr.logDataset(dataset)
 val categoricalFeatures: Map[Int, Int] =
   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
 val numClasses: Int = getNumClasses(dataset)
@@ -112,30 +115,27 @@ class DecisionTreeClassifier @Since("1.4.0") (
 val oldDataset: RDD[LabeledPoint] = extractLabeledPoints(dataset, 
numClasses)
 val strategy = getOldStrategy(categoricalFeatures, numClasses)
 
-instr.logParams(maxDepth, maxBins, minInstancesPerNode, minInfoGain, 
maxMemoryInMB,
+instr.logParams(this, maxDepth, maxBins, minInstancesPerNode, minInfoGain, 
maxMemoryInMB,
   cacheNodeIds, checkpointInterval, impurity, seed)
 
 val trees = RandomForest.run(oldDataset, strategy, numTrees = 1,

[GitHub] spark issue #21799: [SPARK-24852][ML] Update spark.ml to use Instrumentation...

2018-07-20 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21799
  
LGTM
Merging with master
Thanks @MrBago !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-24747][ML] Make Instrumentation class more flexible

2018-07-17 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 7688ce88b -> 912634b00


[SPARK-24747][ML] Make Instrumentation class more flexible

## What changes were proposed in this pull request?

This PR updates the Instrumentation class to make it more flexible and a little 
bit easier to use. When these APIs are merged, I'll followup with a PR to 
update the training code to use these new APIs so we can remove the old APIs. 
These changes are all to private APIs so this PR doesn't make any user facing 
changes.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: Bago Amirbekian 

Closes #21719 from MrBago/new-instrumentation-apis.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/912634b0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/912634b0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/912634b0

Branch: refs/heads/master
Commit: 912634b004c2302533a8a8501b4ecb803d17e335
Parents: 7688ce8
Author: Bago Amirbekian 
Authored: Tue Jul 17 13:11:52 2018 -0700
Committer: Joseph K. Bradley 
Committed: Tue Jul 17 13:11:52 2018 -0700

--
 .../ml/classification/LogisticRegression.scala  |   8 +-
 .../spark/ml/tree/impl/RandomForest.scala   |   2 +-
 .../spark/ml/tuning/ValidatorParams.scala   |   2 +-
 .../apache/spark/ml/util/Instrumentation.scala  | 128 ---
 .../apache/spark/mllib/clustering/KMeans.scala  |   4 +-
 5 files changed, 93 insertions(+), 51 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/912634b0/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 92e342e..25fb9c8 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -35,6 +35,7 @@ import org.apache.spark.ml.optim.loss.{L2Regularization, 
RDDLossFunction}
 import org.apache.spark.ml.param._
 import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.util._
+import org.apache.spark.ml.util.Instrumentation.instrumented
 import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, 
MulticlassMetrics}
 import org.apache.spark.mllib.linalg.VectorImplicits._
 import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
@@ -490,7 +491,7 @@ class LogisticRegression @Since("1.2.0") (
 
   protected[spark] def train(
   dataset: Dataset[_],
-  handlePersistence: Boolean): LogisticRegressionModel = {
+  handlePersistence: Boolean): LogisticRegressionModel = instrumented { 
instr =>
 val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else 
col($(weightCol))
 val instances: RDD[Instance] =
   dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
@@ -500,7 +501,8 @@ class LogisticRegression @Since("1.2.0") (
 
 if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
 
-val instr = Instrumentation.create(this, dataset)
+instr.logPipelineStage(this)
+instr.logDataset(dataset)
 instr.logParams(regParam, elasticNetParam, standardization, threshold,
   maxIter, tol, fitIntercept)
 
@@ -905,8 +907,6 @@ class LogisticRegression @Since("1.2.0") (
 objectiveHistory)
 }
 model.setSummary(Some(logRegSummary))
-instr.logSuccess(model)
-model
   }
 
   @Since("1.4.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/912634b0/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala 
b/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
index 9058701..bb3f3a0 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala
@@ -91,7 +91,7 @@ private[spark] object RandomForest extends Logging {
   numTrees: Int,
   featureSubsetStrategy: String,
   seed: Long,
-  instr: Option[Instrumentation[_]],
+  instr: Option[Instrumentation],
   prune: Boolean = true, // exposed for testing only, real trees are 
always pruned
   parentUID: Option[String] = None): Array[DecisionTreeModel] = {
 

http://git-wip-us.apache.org/repos/asf/spark/blob/912634b0/mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

[GitHub] spark issue #21719: [SPARK-24747][ML] Make Instrumentation class more flexib...

2018-07-17 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21719
  
LGTM
Merging with master
Thanks @MrBago !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21719: [SPARK-24747][ML] Make Instrumentation class more...

2018-07-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21719#discussion_r202800969
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -488,9 +488,10 @@ class LogisticRegression @Since("1.2.0") (
 train(dataset, handlePersistence)
   }
 
+  import Instrumentation.instrumented
--- End diff --

Put import at top of file with the other imports (just to make imports 
easier to track).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21719: [SPARK-24747][ML] Make Instrumentation class more...

2018-07-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21719#discussion_r202805710
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -19,45 +19,60 @@ package org.apache.spark.ml.util
 
 import java.util.UUID
 
-import scala.reflect.ClassTag
+import scala.util.{Failure, Success, Try}
+import scala.util.control.NonFatal
 
 import org.json4s._
 import org.json4s.JsonDSL._
 import org.json4s.jackson.JsonMethods._
 
 import org.apache.spark.internal.Logging
-import org.apache.spark.ml.{Estimator, Model}
-import org.apache.spark.ml.param.Param
+import org.apache.spark.ml.{Estimator, Model, PipelineStage}
+import org.apache.spark.ml.param.{Param, Params}
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.Dataset
 import org.apache.spark.util.Utils
 
 /**
  * A small wrapper that defines a training session for an estimator, and 
some methods to log
  * useful information during this session.
- *
- * A new instance is expected to be created within fit().
- *
- * @param estimator the estimator that is being fit
- * @param dataset the training dataset
- * @tparam E the type of the estimator
  */
-private[spark] class Instrumentation[E <: Estimator[_]] private (
-val estimator: E,
-val dataset: RDD[_]) extends Logging {
+private[spark] class Instrumentation extends Logging {
 
   private val id = UUID.randomUUID()
-  private val prefix = {
+  private val shortId = id.toString.take(8)
+  private val prefix = s"[$shortId] "
+
+  // TODO: update spark.ml to use new Instrumentation APIs and remove this 
constructor
+  var stage: Params = _
--- End diff --

I'd recommend we either plan to remove "stage" or change "logPipelineStage" 
so it only allows setting "stage" once.  If we go with the former, how about 
leaving a note to remove "stage" once spark.ml code is migrated to use the new 
logParams() method?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21719: [SPARK-24747][ML] Make Instrumentation class more flexib...

2018-07-12 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21719
  
jenkins test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20442: [SPARK-23265][ML]Update multi-column error handling logi...

2018-06-13 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20442
  
@huaxingao  Thanks for this follow-up!  I realized that 
https://github.com/apache/spark/pull/19715 introduced a breaking change which 
we missed in Spark 2.3 QA:  In Spark 2.2, a user could set inputCol but not set 
outputCol (since outputCol has a default value).  The new check causes such 
user code to start failing in Spark 2.3.

Since you're already working on this follow-up, would you mind adding a 
unit test which checks this?  (Setting inputCol but not outputCol, and making 
sure that works.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...

2018-05-21 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21129
  
I made a JIRA for the Python part of this: 
https://issues.apache.org/jira/browse/SPARK-24333


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-7132][ML] Add fit with validation set to spark.ml GBT

2018-05-21 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master a33dcf4a0 -> ffaefe755


[SPARK-7132][ML] Add fit with validation set to spark.ml GBT

## What changes were proposed in this pull request?

Add fit with validation set to spark.ml GBT

## How was this patch tested?

Will add later.

Author: WeichenXu 

Closes #21129 from WeichenXu123/gbt_fit_validation.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ffaefe75
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ffaefe75
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ffaefe75

Branch: refs/heads/master
Commit: ffaefe755e20cb94e27f07b233615a4bbb476679
Parents: a33dcf4
Author: WeichenXu 
Authored: Mon May 21 13:05:17 2018 -0700
Committer: Joseph K. Bradley 
Committed: Mon May 21 13:05:17 2018 -0700

--
 .../spark/ml/classification/GBTClassifier.scala | 38 +---
 .../ml/param/shared/SharedParamsCodeGen.scala   |  5 +-
 .../spark/ml/param/shared/sharedParams.scala| 17 +++
 .../spark/ml/regression/GBTRegressor.scala  | 31 +++--
 .../org/apache/spark/ml/tree/treeParams.scala   | 41 -
 .../ml/classification/GBTClassifierSuite.scala  | 46 +++
 .../spark/ml/regression/GBTRegressorSuite.scala | 48 +++-
 project/MimaExcludes.scala  | 13 +-
 8 files changed, 213 insertions(+), 26 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ffaefe75/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
index 3fb6d1e..337133a 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
@@ -146,12 +146,21 @@ class GBTClassifier @Since("1.4.0") (
   @Since("1.4.0")
   def setLossType(value: String): this.type = set(lossType, value)
 
+  /** @group setParam */
+  @Since("2.4.0")
+  def setValidationIndicatorCol(value: String): this.type = {
+set(validationIndicatorCol, value)
+  }
+
   override protected def train(dataset: Dataset[_]): GBTClassificationModel = {
 val categoricalFeatures: Map[Int, Int] =
   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
+
+val withValidation = isDefined(validationIndicatorCol) && 
$(validationIndicatorCol).nonEmpty
+
 // We copy and modify this from Classifier.extractLabeledPoints since GBT 
only supports
 // 2 classes now.  This lets us provide a more precise error message.
-val oldDataset: RDD[LabeledPoint] =
+val convert2LabeledPoint = (dataset: Dataset[_]) => {
   dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map {
 case Row(label: Double, features: Vector) =>
   require(label == 0 || label == 1, s"GBTClassifier was given" +
@@ -159,7 +168,18 @@ class GBTClassifier @Since("1.4.0") (
 s" GBTClassifier currently only supports binary classification.")
   LabeledPoint(label, features)
   }
-val numFeatures = oldDataset.first().features.size
+}
+
+val (trainDataset, validationDataset) = if (withValidation) {
+  (
+
convert2LabeledPoint(dataset.filter(not(col($(validationIndicatorCol),
+convert2LabeledPoint(dataset.filter(col($(validationIndicatorCol
+  )
+} else {
+  (convert2LabeledPoint(dataset), null)
+}
+
+val numFeatures = trainDataset.first().features.size
 val boostingStrategy = super.getOldBoostingStrategy(categoricalFeatures, 
OldAlgo.Classification)
 
 val numClasses = 2
@@ -169,15 +189,21 @@ class GBTClassifier @Since("1.4.0") (
 s" numClasses=$numClasses, but thresholds has length 
${$(thresholds).length}")
 }
 
-val instr = Instrumentation.create(this, oldDataset)
+val instr = Instrumentation.create(this, dataset)
 instr.logParams(labelCol, featuresCol, predictionCol, impurity, lossType,
   maxDepth, maxBins, maxIter, maxMemoryInMB, minInfoGain, 
minInstancesPerNode,
-  seed, stepSize, subsamplingRate, cacheNodeIds, checkpointInterval, 
featureSubsetStrategy)
+  seed, stepSize, subsamplingRate, cacheNodeIds, checkpointInterval, 
featureSubsetStrategy,
+  validationIndicatorCol)
 instr.logNumFeatures(numFeatures)
 instr.logNumClasses(numClasses)
 
-val (baseLearners, learnerWeights) = GradientBoostedTrees.run(oldDataset, 
boostingStrategy,
-  $(seed), $(featureSubsetStrategy))
+val (baseLearners, learnerWeights) = if

[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...

2018-05-21 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21129
  
LGTM
Merging with master
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21163: [SPARK-24097][ML] Instrumentation improvements - RandomF...

2018-05-17 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21163
  
jenkins test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20319: [SPARK-22884][ML][TESTS] ML test for StructuredStreaming...

2018-05-17 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20319
  
Done!  Here it is: https://github.com/apache/spark/pull/21358

@smurakozi Could you please close this issue and help review the new PR if 
you have time?  Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21358: [SPARK-22884][ML] ML tests for StructuredStreamin...

2018-05-17 Thread jkbradley

GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/21358

[SPARK-22884][ML] ML tests for StructuredStreaming: spark.ml.clustering

## What changes were proposed in this pull request?

Converting clustering tests to also check code with structured streaming, 
using the ML testing infrastructure implemented in SPARK-22882.

This PR is a new version of https://github.com/apache/spark/pull/20319

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark smurakozi-SPARK-22884

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21358.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21358


commit 496bc7ba688260317fe5907a8057347a5b184490
Author: Sandor Murakozi <smurakozi@...>
Date:   2018-01-18T19:03:33Z

Converted all clustering tests to check streaming

commit 58239ba3edc0c60900d842caf3d38e30507d885f
Author: Sandor Murakozi <smurakozi@...>
Date:   2018-01-18T20:12:46Z

formatting, nits

commit eb66659e8b65de9d539584b935899838e1dd9ed8
Author: Joseph K. Bradley <joseph@...>
Date:   2018-05-17T21:26:17Z

cleanups

commit 95f6f2919a5ab0853434592000f5b811915ca15f
Author: Joseph K. Bradley <joseph@...>
Date:   2018-05-17T22:11:20Z

more cleanups




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20319: [SPARK-22884][ML][TESTS] ML test for StructuredStreaming...

2018-05-17 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20319
  
I'm going to take this over to get this done, but @smurakozi you'll be the 
primary author.  I'll link the PR here in a minute


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21163: [SPARK-24097][ML] Instrumentation improvements - RandomF...

2018-05-17 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21163
  
LGTM pending fresh tests
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21344: [SPARK-24114] Add instrumentation to FPGrowth.

2018-05-17 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21344
  
Whoops!  This was merged linked to the wrong JIRA.  It should have been: 
https://issues.apache.org/jira/browse/SPARK-24310 for the record.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-24114] Add instrumentation to FPGrowth.

2018-05-17 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master a7a9b1837 -> 439c69511


[SPARK-24114] Add instrumentation to FPGrowth.

## What changes were proposed in this pull request?

Have FPGrowth keep track of model training using the Instrumentation class.

## How was this patch tested?

manually

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: Bago Amirbekian 

Closes #21344 from MrBago/fpgrowth-instr.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/439c6951
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/439c6951
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/439c6951

Branch: refs/heads/master
Commit: 439c69511812776cb4b82956547ce958d0669c52
Parents: a7a9b18
Author: Bago Amirbekian 
Authored: Thu May 17 13:42:10 2018 -0700
Committer: Joseph K. Bradley 
Committed: Thu May 17 13:42:10 2018 -0700

--
 mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/439c6951/mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala 
b/mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala
index 0bf405d..d7fbe28 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala
@@ -161,6 +161,8 @@ class FPGrowth @Since("2.2.0") (
   private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel = {
 val handlePersistence = dataset.storageLevel == StorageLevel.NONE
 
+val instr = Instrumentation.create(this, dataset)
+instr.logParams(params: _*)
 val data = dataset.select($(itemsCol))
 val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => 
r.getSeq[Any](0).toArray)
 val mllibFP = new MLlibFPGrowth().setMinSupport($(minSupport))
@@ -183,7 +185,9 @@ class FPGrowth @Since("2.2.0") (
   items.unpersist()
 }
 
-copyValues(new FPGrowthModel(uid, frequentItems)).setParent(this)
+val model = copyValues(new FPGrowthModel(uid, 
frequentItems)).setParent(this)
+instr.logSuccess(model)
+model
   }
 
   @Since("2.2.0")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] spark issue #21344: [SPARK-24114] Add instrumentation to FPGrowth.

2018-05-17 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21344
  
LGTM
It'll be nice when we move the implementation to spark.ml in the future so 
that we can log more info, but this is good for now.
Thanks @MrBago and @ludatabricks !

Merging with master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21344: [SPARK-24114] Add instrumentation to FPGrowth.

2018-05-17 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21344
  
I'll take a look now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21347: [SPARK-24290][ML] add support for Array input for...

2018-05-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21347#discussion_r189057734
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -125,6 +125,19 @@ private[spark] class Instrumentation[E <: 
Estimator[_]] private (
 log(compact(render(name -> value)))
   }
 
+  def logNamedValue(name: String, value: Array[String]): Unit = {
+log(compact(render(name -> value.toSeq)))
--- End diff --

I see, so you're pointing out that our current approach is inconsistent: 
JSONifying array values (for clustering) but not JSONifying scalar values 
(Long, etc.)?  I don't have a good sense of what's best, but if we just have to 
pick something, I'd suggest not JSONifying values since doing so creates 
slightly longer strings to log (with extra quotes) and since that maintains a 
stable format from Spark 2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r189052646
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -460,18 +461,37 @@ private[ml] trait RandomForestRegressorParams
  *
  * Note: Marked as private and DeveloperApi since this may be made public 
in the future.
  */
-private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize {
+private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize
+  with HasValidationIndicatorCol {
 
-  /* TODO: Add this doc when we add this param.  SPARK-7132
-   * Threshold for stopping early when runWithValidation is used.
+  /**
+   * Threshold for stopping early when fit with validation is used.
* If the error rate on the validation input changes by less than the 
validationTol,
--- End diff --

This doc is not quite accurate.  Can you please update it to:
```
Threshold for stopping early when fit with validation is used.
(This parameter is ignored when fit without validation is used.)
The decision to stop early is decided based on this logic:
If the current loss on the validation set is greater than 0.01, the diff
of validation error is compared to relative tolerance which is
validationTol * (current loss on the validation set).
If the current loss on the validation set is less than or equal to 0.01,
the diff of validation error is compared to absolute tolerance which is
validationTol * 0.01.
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21090: [SPARK-24026][ML] Add Power Iteration Clustering ...

2018-05-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21090#discussion_r188813735
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,256 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.clustering.{PowerIterationClustering => 
MLlibPowerIterationClustering}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types._
+
+/**
+ * Common params for PowerIterationClustering
+ */
+private[clustering] trait PowerIterationClusteringParams extends Params 
with HasMaxIter
+  with HasPredictionCol {
+
+  /**
+   * The number of clusters to create (k). Must be  1. Default: 2.
+   * @group param
+   */
+  @Since("2.4.0")
+  final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
+"Must be > 1.", ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("2.4.0")
+  def getK: Int = $(k)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to use a random vector
+   * as vertex properties, or "degree" to use a normalized sum of 
similarities with other vertices.
+   * Default: random.
+   * @group expertParam
+   */
+  @Since("2.4.0")
+  final val initMode = {
+val allowedParams = ParamValidators.inArray(Array("random", "degree"))
+new Param[String](this, "initMode", "The initialization algorithm. 
This can be either " +
+  "'random' to use a random vector as vertex properties, or 'degree' 
to use a normalized sum " +
+  "of similarities with other vertices.  Supported options: 'random' 
and 'degree'.",
+  allowedParams)
+  }
+
+  /** @group expertGetParam */
+  @Since("2.4.0")
+  def getInitMode: String = $(initMode)
+
+  /**
+   * Param for the name of the input column for vertex IDs.
+   * Default: "id"
+   * @group param
+   */
+  @Since("2.4.0")
+  val idCol = new Param[String](this, "idCol", "Name of the input column 
for vertex IDs.",
+(value: String) => value.nonEmpty)
+
+  setDefault(idCol, "id")
+
+  /** @group getParam */
+  @Since("2.4.0")
+  def getIdCol: String = getOrDefault(idCol)
+
+  /**
+   * Param for the name of the input column for neighbors in the adjacency 
list representation.
+   * Default: "neighbors"
+   * @group param
+   */
+  @Since("2.4.0")
+  val neighborsCol = new Param[String](this, "neighborsCol",
+"Name of the input column for neighbors in the adjacency list 
representation.",
+(value: String) => value.nonEmpty)
+
+  setDefault(neighborsCol, "neighbors")
+
+  /** @group getParam */
+  @Since("2.4.0")
+  def getNeighborsCol: String = $(neighborsCol)
+
+  /**
+   * Param for the name of the input column for neighbors in the adjacency 
list representation.
+   * Default: "similarities"
+   * @group param
+   */
+  @Since("2.4.0")
+  val similaritiesCol = new Param[String](this, "similaritiesCol",
+"Name of the input column for neighbors in the adjacency list 
representation.",
+(value: String) => value.nonEmpty)
+
--- End diff --

No, it's meant to

[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-05-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r188813405
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{ArrayType, LongType, StructField, 
StructType}
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830;>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining;>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset, rows 
with nulls in this column
+   *are ignored
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (recommended value: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern
+   * (recommended value: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run
+   *   (recommended value: `3200`).
+   * @return A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+   * The schema of it will be:
+   *  - `sequence: Seq[Seq[T]]` (T is the item type)
+   *  - `freq: Long`
+   */
+  @Since("2.4.0")
+  def findFrequentSequentialPatterns(
+  dataset: Dataset[_],
+  sequenceCol: String,
--- End diff --

@WeichenXu123 Do you have time to send a PR to update this API?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-05-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r188813297
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{ArrayType, LongType, StructField, 
StructType}
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830;>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining;>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset, rows 
with nulls in this column
+   *are ignored
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (recommended value: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern
+   * (recommended value: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run
+   *   (recommended value: `3200`).
+   * @return A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+   * The schema of it will be:
+   *  - `sequence: Seq[Seq[T]]` (T is the item type)
+   *  - `freq: Long`
+   */
+  @Since("2.4.0")
+  def findFrequentSequentialPatterns(
+  dataset: Dataset[_],
+  sequenceCol: String,
--- End diff --

Oh, I think you're right @mengxr .  That approach sounds good.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-22210][ML] Add seed for LDA variationalTopicInference

2018-05-16 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 991726f31 -> bfd75cdfb


[SPARK-22210][ML] Add seed for LDA variationalTopicInference

## What changes were proposed in this pull request?

- Add seed parameter for variationalTopicInference

- Add seed for calling variationalTopicInference in submitMiniBatch

- Add var seed in LDAModel so that it can take the seed from LDA and use it for 
the function call of variationalTopicInference in logLikelihoodBound, 
topicDistributions, getTopicDistributionMethod, and topicDistribution.

## How was this patch tested?

Check the test result in mllib.clustering.LDASuite to make sure the result is 
repeatable with the seed.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: Lu WANG 

Closes #21183 from ludatabricks/SPARK-22210.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bfd75cdf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bfd75cdf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bfd75cdf

Branch: refs/heads/master
Commit: bfd75cdfb22a8c2fb005da597621e1ccd3990e82
Parents: 991726f
Author: Lu WANG 
Authored: Wed May 16 17:54:06 2018 -0700
Committer: Joseph K. Bradley 
Committed: Wed May 16 17:54:06 2018 -0700

--
 .../org/apache/spark/ml/clustering/LDA.scala|  6 ++-
 .../spark/mllib/clustering/LDAModel.scala   | 34 +---
 .../spark/mllib/clustering/LDAOptimizer.scala   | 42 +++-
 .../apache/spark/ml/clustering/LDASuite.scala   |  6 +++
 4 files changed, 64 insertions(+), 24 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bfd75cdf/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala
index afe599c..fed42c9 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala
@@ -569,10 +569,14 @@ abstract class LDAModel private[ml] (
 class LocalLDAModel private[ml] (
 uid: String,
 vocabSize: Int,
-@Since("1.6.0") override private[clustering] val oldLocalModel: 
OldLocalLDAModel,
+private[clustering] val oldLocalModel_ : OldLocalLDAModel,
 sparkSession: SparkSession)
   extends LDAModel(uid, vocabSize, sparkSession) {
 
+  override private[clustering] def oldLocalModel: OldLocalLDAModel = {
+oldLocalModel_.setSeed(getSeed)
+  }
+
   @Since("1.6.0")
   override def copy(extra: ParamMap): LocalLDAModel = {
 val copied = new LocalLDAModel(uid, vocabSize, oldLocalModel, sparkSession)

http://git-wip-us.apache.org/repos/asf/spark/blob/bfd75cdf/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
index b8a6e94..f915062 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
@@ -32,7 +32,7 @@ import org.apache.spark.mllib.linalg.{Matrices, Matrix, 
Vector, Vectors}
 import org.apache.spark.mllib.util.{Loader, Saveable}
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{Row, SparkSession}
-import org.apache.spark.util.BoundedPriorityQueue
+import org.apache.spark.util.{BoundedPriorityQueue, Utils}
 
 /**
  * Latent Dirichlet Allocation (LDA) model.
@@ -194,6 +194,8 @@ class LocalLDAModel private[spark] (
 override protected[spark] val gammaShape: Double = 100)
   extends LDAModel with Serializable {
 
+  private var seed: Long = Utils.random.nextLong()
+
   @Since("1.3.0")
   override def k: Int = topics.numCols
 
@@ -216,6 +218,21 @@ class LocalLDAModel private[spark] (
 
   override protected def formatVersion = "1.0"
 
+  /**
+   * Random seed for cluster initialization.
+   */
+  @Since("2.4.0")
+  def getSeed: Long = seed
+
+  /**
+   * Set the random seed for cluster initialization.
+   */
+  @Since("2.4.0")
+  def setSeed(seed: Long): this.type = {
+this.seed = seed
+this
+  }
+
   @Since("1.5.0")
   override def save(sc: SparkContext, path: String): Unit = {
 LocalLDAModel.SaveLoadV1_0.save(sc, path, topicsMatrix, docConcentration, 
topicConcentration,
@@ -298,6 +315,7 @@ class LocalLDAModel private[spark] (
 // by topic (columns of lambda)
 val Elogbeta = LDAUtils.dirichletExpectation(lambda.t).t
 val ElogbetaBc = documents.sparkContext.broadcast(Elogbeta)
+val

[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...

2018-05-16 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21183
  
Thanks for checking this manually.  Since the test sometimes fails, then 
let's leave it.

LGTM
Merging with master
Thanks @ludatabricks and @mengxr !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-24058][ML][PYSPARK] Default Params in ML should be saved separately: Python API

2018-05-15 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 6b94420f6 -> 8a13c5096


[SPARK-24058][ML][PYSPARK] Default Params in ML should be saved separately: 
Python API

## What changes were proposed in this pull request?

See SPARK-23455 for reference. Now default params in ML are saved separately in 
metadata file in Scala. We must change it for Python for Spark 2.4.0 as well in 
order to keep them in sync.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh 

Closes #21153 from viirya/SPARK-24058.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8a13c509
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8a13c509
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8a13c509

Branch: refs/heads/master
Commit: 8a13c5096898f95d1dfcedaf5d31205a1cbf0a19
Parents: 6b94420
Author: Liang-Chi Hsieh 
Authored: Tue May 15 16:50:09 2018 -0700
Committer: Joseph K. Bradley 
Committed: Tue May 15 16:50:09 2018 -0700

--
 python/pyspark/ml/tests.py | 38 ++
 python/pyspark/ml/util.py  | 30 --
 2 files changed, 66 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8a13c509/python/pyspark/ml/tests.py
--
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 0935931..0dde0db 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -1595,6 +1595,44 @@ class PersistenceTest(SparkSessionTestCase):
 self.assertEqual(lr.uid, lr3.uid)
 self.assertEqual(lr.extractParamMap(), lr3.extractParamMap())
 
+def test_default_read_write_default_params(self):
+lr = LogisticRegression()
+self.assertFalse(lr.isSet(lr.getParam("threshold")))
+
+lr.setMaxIter(50)
+lr.setThreshold(.75)
+
+# `threshold` is set by user, default param `predictionCol` is not set 
by user.
+self.assertTrue(lr.isSet(lr.getParam("threshold")))
+self.assertFalse(lr.isSet(lr.getParam("predictionCol")))
+self.assertTrue(lr.hasDefault(lr.getParam("predictionCol")))
+
+writer = DefaultParamsWriter(lr)
+metadata = json.loads(writer._get_metadata_to_save(lr, self.sc))
+self.assertTrue("defaultParamMap" in metadata)
+
+reader = DefaultParamsReadable.read()
+metadataStr = json.dumps(metadata, separators=[',',  ':'])
+loadedMetadata = reader._parseMetaData(metadataStr, )
+reader.getAndSetParams(lr, loadedMetadata)
+
+self.assertTrue(lr.isSet(lr.getParam("threshold")))
+self.assertFalse(lr.isSet(lr.getParam("predictionCol")))
+self.assertTrue(lr.hasDefault(lr.getParam("predictionCol")))
+
+# manually create metadata without `defaultParamMap` section.
+del metadata['defaultParamMap']
+metadataStr = json.dumps(metadata, separators=[',',  ':'])
+loadedMetadata = reader._parseMetaData(metadataStr, )
+with self.assertRaisesRegexp(AssertionError, "`defaultParamMap` 
section not found"):
+reader.getAndSetParams(lr, loadedMetadata)
+
+# Prior to 2.4.0, metadata doesn't have `defaultParamMap`.
+metadata['sparkVersion'] = '2.3.0'
+metadataStr = json.dumps(metadata, separators=[',',  ':'])
+loadedMetadata = reader._parseMetaData(metadataStr, )
+reader.getAndSetParams(lr, loadedMetadata)
+
 
 class LDATest(SparkSessionTestCase):
 

http://git-wip-us.apache.org/repos/asf/spark/blob/8a13c509/python/pyspark/ml/util.py
--
diff --git a/python/pyspark/ml/util.py b/python/pyspark/ml/util.py
index a486c6a..9fa8566 100644
--- a/python/pyspark/ml/util.py
+++ b/python/pyspark/ml/util.py
@@ -30,6 +30,7 @@ if sys.version > '3':
 from pyspark import SparkContext, since
 from pyspark.ml.common import inherit_doc
 from pyspark.sql import SparkSession
+from pyspark.util import VersionUtils
 
 
 def _jvm():
@@ -396,6 +397,7 @@ class DefaultParamsWriter(MLWriter):
 - sparkVersion
 - uid
 - paramMap
+- defaultParamMap (since 2.4.0)
 - (optionally, extra metadata)
 :param extraMetadata:  Extra metadata to be saved at same level as 
uid, paramMap, etc.
 :param paramMap:  If given, this is saved in the "paramMap" field.
@@ -417,15 +419,24 @@ class DefaultParamsWriter(MLWriter):
 """
 uid = instance.uid
 cls = instance.__module__ + '.' + instance.__class__.__name__
-params = instance.extractParamMap()
+
+# User-supplied param values
+params = instance._paramMap
 jsonParams = {}
 if paramMap

[GitHub] spark issue #21153: [SPARK-24058][ML][PySpark] Default Params in ML should b...

2018-05-15 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21153
  
OK thanks @viirya !
Merging with master



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21153: [SPARK-24058][ML][PySpark] Default Params in ML s...

2018-05-14 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21153#discussion_r187129517
  
--- Diff: python/pyspark/ml/util.py ---
@@ -396,6 +397,7 @@ def saveMetadata(instance, path, sc, 
extraMetadata=None, paramMap=None):
 - sparkVersion
 - uid
 - paramMap
+- defalutParamMap (since 2.4.0)
--- End diff --

typo: default


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-05-14 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21183#discussion_r188081089
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -473,7 +475,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer 
with Logging {
 None
   }
 
-val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitions { docs =>
+val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitionsWithIndex
--- End diff --

fix scala style:
```
val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitionsWithIndex {
  (index, docs) =>
val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0)

val stat = BDM.zeros[Double](k, vocabSize)
val logphatPartOption = logphatPartOptionBase()
var nonEmptyDocCount: Long = 0L
nonEmptyDocs.foreach { case (_, termCounts: Vector) =>
  nonEmptyDocCount += 1
  val (gammad, sstats, ids) = 
OnlineLDAOptimizer.variationalTopicInference(
termCounts, expElogbetaBc.value, alpha, gammaShape, k, seed + 
index)
  stat(::, ids) := stat(::, ids) + sstats
  logphatPartOption.foreach(_ += 
LDAUtils.dirichletExpectation(gammad))
}
Iterator((stat, logphatPartOption, nonEmptyDocCount))
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21153: [SPARK-24058][ML][PySpark] Default Params in ML should b...

2018-05-10 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21153
  
Would you mind rebasing this off of the upstream master branch?  I'm having 
trouble running the tests for this PR locally.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21274: [SPARK-24213][ML] Fix for Int id type for PowerIt...

2018-05-10 Thread jkbradley

Github user jkbradley closed the pull request at:

https://github.com/apache/spark/pull/21274


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21274: [SPARK-24213][ML] Fix for Int id type for PowerIteration...

2018-05-10 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21274
  
This issue actually brings up a problem with the Transformer approach for 
PIC.  Just commented more here: 
https://issues.apache.org/jira/browse/SPARK-15784

Thank you for pushing back!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC

2018-05-10 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21119
  
I think we messed up with the original PIC API.  Could you please check out 
my comment here https://issues.apache.org/jira/browse/SPARK-15784 ?  If others 
agree, I'll revert the Scala API and we can work on adding a modified version.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-05-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21183#discussion_r187217859
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -622,11 +623,11 @@ object LocalLDAModel extends 
MLReadable[LocalLDAModel] {
   val vectorConverted = MLUtils.convertVectorColumnsToML(data, 
"docConcentration")
   val matrixConverted = 
MLUtils.convertMatrixColumnsToML(vectorConverted, "topicsMatrix")
   val Row(vocabSize: Int, topicsMatrix: Matrix, docConcentration: 
Vector,
-  topicConcentration: Double, gammaShape: Double) =
+  topicConcentration: Double, gammaShape: Double, seed: Long) =
--- End diff --

This will break backwards compatibility of ML persistence (when users try 
to load LDAModels saved using past versions of Spark).  Could you please test 
this manually by saving a LocalLDAModel using Spark 2.3 and loading it with a 
build of your PR?  You can fix this by checking for the Spark version (in the 
`metadata`) and loading the seed for Spark >= 2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-05-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21183#discussion_r187216371
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
@@ -252,6 +252,15 @@ class LDASuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultRead
 val lda = new LDA()
 testEstimatorAndModelReadWrite(lda, dataset, LDASuite.allParamSettings,
   LDASuite.allParamSettings, checkModelData)
+
+def checkModelDataWithDataset(model: LDAModel, model2: LDAModel,
--- End diff --

style: Please fix this to match other multi-line method headers.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...

2018-05-09 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21265
  
Let's wait on this until we make the decision in the last thread in 
https://github.com/apache/spark/pull/20973


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-05-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r187216080
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{ArrayType, LongType, StructField, 
StructType}
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830;>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining;>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset, rows 
with nulls in this column
+   *are ignored
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (recommended value: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern
+   * (recommended value: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run
+   *   (recommended value: `3200`).
+   * @return A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+   * The schema of it will be:
+   *  - `sequence: Seq[Seq[T]]` (T is the item type)
+   *  - `freq: Long`
+   */
+  @Since("2.4.0")
+  def findFrequentSequentialPatterns(
+  dataset: Dataset[_],
+  sequenceCol: String,
--- End diff --

I agree in general, but I donât think itâs a big deal for PrefixSpan.  
I think of our current static method as a temporary workaround until we do the 
work to build a Model which can make meaningful predictions.  This will mean 
that further PrefixSpan improvements may be blocked on this Model work, but I 
think thatâs OK since predictions should be the next priority for PrefixSpan. 
 Once we have a Model, I recommend we deprecate the current static method.

I'm also OK with changing this to use setters, but then we should name it 
something else so that we can replace it with an Estimator + Model pair later 
on.  I'd suggest "PrefixSpanBuilder."


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

2018-05-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r187187330
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, 
itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(object):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830;>here).
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def findFrequentSequentialPatterns(dataset,
+   sequenceCol,
+   minSupport,
+   maxPatternLength,
+   maxLocalProjDBSize):
+"""
+.. note:: Experimental
+Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
+
+:param dataset: A dataset or a dataframe containing a sequence 
column which is
+`Seq[Seq[_]]` type.
+:param sequenceCol: The name of the sequence column in dataset, 
rows with nulls in this
+column are ignored.
+:param minSupport: The minimal support level of the sequential 
pattern, any pattern that
+   appears more than (minSupport * 
size-of-the-dataset) times will be
+   output (recommended value: `0.1`).
+:param maxPatternLength: The maximal length of the sequential 
pattern
+ (recommended value: `10`).
+:param maxLocalProjDBSize: The maximum number of items (including 
delimiters used in the
+   internal storage format) allowed in a 
projected database before
+   local processing. If a projected 
database exceeds this size,
+   another iteration of distributed prefix 
growth is run
+   (recommended value: `3200`).
+:return: A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+ The schema of it will be:
+  - `sequence: Seq[Seq[T]]` (T is the item type)
+  - `freq: Long`
+
+>>> from pyspark.ml.fpm import PrefixSpan
+>>> from pyspark.sql import Row
+>>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
--- End diff --

My 2 cents: That sounds like a judgement call: If it's to explain the 
behavior more clearly, then that sounds reasonable.  I feel like it's pretty 
clear how nulls are treated from the doc.  maxPatternLength might benefit from 
an example.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21203: [SPARK-24131][PySpark] Add majorMinorVersion API to PySp...

2018-05-09 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21203
  
It's OK but would you mind fixing it @viirya before we use it in 
https://github.com/apache/spark/pull/21153 ?  Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs

2018-05-09 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 628c7b517 -> 7aaa148f5


[SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for 
spark.ml GBTs

## What changes were proposed in this pull request?

Provide evaluateEachIteration method or equivalent for spark.ml GBTs.

## How was this patch tested?

UT.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Author: WeichenXu 

Closes #21097 from WeichenXu123/GBTeval.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7aaa148f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7aaa148f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7aaa148f

Branch: refs/heads/master
Commit: 7aaa148f593470b2c32221b69097b8b54524eb74
Parents: 628c7b5
Author: WeichenXu 
Authored: Wed May 9 11:09:19 2018 -0700
Committer: Joseph K. Bradley 
Committed: Wed May 9 11:09:19 2018 -0700

--
 .../spark/ml/classification/GBTClassifier.scala | 15 +
 .../spark/ml/regression/GBTRegressor.scala  | 17 ++-
 .../org/apache/spark/ml/tree/treeParams.scala   |  6 +++-
 .../ml/classification/GBTClassifierSuite.scala  | 29 +-
 .../spark/ml/regression/GBTRegressorSuite.scala | 32 ++--
 5 files changed, 94 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7aaa148f/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
index 0aa24f0..3fb6d1e 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
@@ -334,6 +334,21 @@ class GBTClassificationModel private[ml](
   // hard coded loss, which is not meant to be changed in the model
   private val loss = getOldLossType
 
+  /**
+   * Method to compute error or loss for every iteration of gradient boosting.
+   *
+   * @param dataset Dataset for validation.
+   */
+  @Since("2.4.0")
+  def evaluateEachIteration(dataset: Dataset[_]): Array[Double] = {
+val data = dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map {
+  case Row(label: Double, features: Vector) => LabeledPoint(label, 
features)
+}
+GradientBoostedTrees.evaluateEachIteration(data, trees, treeWeights, loss,
+  OldAlgo.Classification
+)
+  }
+
   @Since("2.0.0")
   override def write: MLWriter = new 
GBTClassificationModel.GBTClassificationModelWriter(this)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/7aaa148f/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala
index 8598e80..d7e054b 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala
@@ -34,7 +34,7 @@ import org.apache.spark.ml.util.DefaultParamsReader.Metadata
 import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo}
 import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel => 
OldGBTModel}
 import org.apache.spark.rdd.RDD
-import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
 import org.apache.spark.sql.functions._
 
 /**
@@ -269,6 +269,21 @@ class GBTRegressionModel private[ml](
 new OldGBTModel(OldAlgo.Regression, _trees.map(_.toOld), _treeWeights)
   }
 
+  /**
+   * Method to compute error or loss for every iteration of gradient boosting.
+   *
+   * @param dataset Dataset for validation.
+   * @param loss The loss function used to compute error. Supported options: 
squared, absolute
+   */
+  @Since("2.4.0")
+  def evaluateEachIteration(dataset: Dataset[_], loss: String): Array[Double] 
= {
+val data = dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map {
+  case Row(label: Double, features: Vector) => LabeledPoint(label, 
features)
+}
+GradientBoostedTrees.evaluateEachIteration(data, trees, treeWeights,
+  convertToOldLossType(loss), OldAlgo.Regression)
+  }
+
   @Since("2.0.0")
   override def write: MLWriter = new 
GBTRegressionModel.GBTRegressionModelWriter(this)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/7aaa148f/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala

[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...

2018-05-09 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21097
  
LGTM
Merging with master
Thanks @WeichenXu123 !   Would you mind creating & linking a JIRA for the 
Python API update?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21274: [SPARK-24213][ML] Fix for Int id type for PowerIteration...

2018-05-09 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21274
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [MINOR][ML][DOC] Improved Naive Bayes user guide explanation

2018-05-09 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 6ea582e36 -> 94155d039


[MINOR][ML][DOC] Improved Naive Bayes user guide explanation

## What changes were proposed in this pull request?

This copies the material from the spark.mllib user guide page for Naive Bayes 
to the spark.ml user guide page.  I also improved the wording and organization 
slightly.

## How was this patch tested?

Built docs locally.

Author: Joseph K. Bradley <jos...@databricks.com>

Closes #21272 from jkbradley/nb-doc-update.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/94155d03
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/94155d03
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/94155d03

Branch: refs/heads/master
Commit: 94155d0395324a012db2fc8a57edb3cd90b61e96
Parents: 6ea582e
Author: Joseph K. Bradley <jos...@databricks.com>
Authored: Wed May 9 10:34:57 2018 -0700
Committer: Joseph K. Bradley <jos...@databricks.com>
Committed: Wed May 9 10:34:57 2018 -0700

--
 docs/ml-classification-regression.md | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/94155d03/docs/ml-classification-regression.md
--
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index d660655..b3d1090 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -455,11 +455,29 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 ## Naive Bayes
 
 [Naive Bayes classifiers](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) 
are a family of simple 
-probabilistic classifiers based on applying Bayes' theorem with strong (naive) 
independence 
-assumptions between the features. The `spark.ml` implementation currently 
supports both [multinomial
-naive 
Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
+probabilistic, multiclass classifiers based on applying Bayes' theorem with 
strong (naive) independence 
+assumptions between every pair of features.
+
+Naive Bayes can be trained very efficiently. With a single pass over the 
training data,
+it computes the conditional probability distribution of each feature given 
each label.
+For prediction, it applies Bayes' theorem to compute the conditional 
probability distribution
+of each label given an observation.
+
+MLlib supports both [multinomial naive 
Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
 and [Bernoulli naive 
Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
-More information can be found in the section on [Naive Bayes in 
MLlib](mllib-naive-bayes.html#naive-bayes-sparkmllib).
+
+*Input data*:
+These models are typically used for [document 
classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).
+Within that context, each observation is a document and each feature 
represents a term.
+A feature's value is the frequency of the term (in multinomial Naive Bayes) or
+a zero or one indicating whether the term was found in the document (in 
Bernoulli Naive Bayes).
+Feature values must be *non-negative*. The model type is selected with an 
optional parameter
+"multinomial" or "bernoulli" with "multinomial" as the default.
+For document classification, the input feature vectors should usually be 
sparse vectors.
+Since the training data is only used once, it is not necessary to cache it.
+
+[Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be 
used by
+setting the parameter $\lambda$ (default to $1.0$). 
 
 **Examples**
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] spark issue #21272: [MINOR][ML][DOC] Improved Naive Bayes user guide explana...

2018-05-09 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21272
  
Thanks for the LGTM!  Merging with master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r187114172
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams
  *
  * Note: Marked as private and DeveloperApi since this may be made public 
in the future.
  */
-private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize {
+private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize
+  with HasValidationIndicatorCol {
 
-  /* TODO: Add this doc when we add this param.  SPARK-7132
-   * Threshold for stopping early when runWithValidation is used.
+  /**
+   * Threshold for stopping early when fit with validation is used.
* If the error rate on the validation input changes by less than the 
validationTol,
--- End diff --

Let's add the more precise description from the old BoostingStrategy for 
this Param.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r187113953
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams
  *
  * Note: Marked as private and DeveloperApi since this may be made public 
in the future.
  */
-private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize {
+private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize
+  with HasValidationIndicatorCol {
 
-  /* TODO: Add this doc when we add this param.  SPARK-7132
-   * Threshold for stopping early when runWithValidation is used.
+  /**
+   * Threshold for stopping early when fit with validation is used.
* If the error rate on the validation input changes by less than the 
validationTol,
-   * then learning will stop early (before [[numIterations]]).
-   * This parameter is ignored when run is used.
+   * then learning will stop early (before [[maxIter]]).
+   * This parameter is ignored when fit without validation is used.
* (default = 1e-5)
* @group param
--- End diff --

Let's add a `@see` validationIndicatorCol


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r187112582
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams
  *
  * Note: Marked as private and DeveloperApi since this may be made public 
in the future.
  */
-private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize {
+private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize
+  with HasValidationIndicatorCol {
 
-  /* TODO: Add this doc when we add this param.  SPARK-7132
-   * Threshold for stopping early when runWithValidation is used.
+  /**
+   * Threshold for stopping early when fit with validation is used.
* If the error rate on the validation input changes by less than the 
validationTol,
-   * then learning will stop early (before [[numIterations]]).
-   * This parameter is ignored when run is used.
+   * then learning will stop early (before [[maxIter]]).
+   * This parameter is ignored when fit without validation is used.
* (default = 1e-5)
--- End diff --

I forget why we chose 1e-5 (which is different from spark.mllib).  What do 
you think about using 0.01 to match the sklearn docs here? 
http://scikit-learn.org/dev/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html
  (I also checked xgboost, but they use a different approach based on x number 
of steps without improvement.  We may want to add that at some point since it 
sounds more robust.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21218: [SPARK-24155][ML] Instrumentation improvements fo...

2018-05-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21218#discussion_r187115704
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala ---
@@ -278,6 +279,7 @@ class BisectingKMeans @Since("2.0.0") (
 val summary = new BisectingKMeansSummary(
   model.transform(dataset), $(predictionCol), $(featuresCol), $(k))
 model.setSummary(Some(summary))
+instr.logNamedValue("clusterSizes", summary.clusterSizes.mkString(", 
"))
--- End diff --

We log Params in JSON format, but we don't have a precedent for logging 
Array values.  I'd be OK with ducking the issue by not logging this value, or 
with logging it as JSON.  If the latter, then we could add a method 
Instrumentation.logNamedValue taking Array types.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21270: [SPARK-24213][ML]Power Iteration Clustering in SparkML t...

2018-05-08 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21270
  
Thanks for the patch!  I just commented on 
https://issues.apache.org/jira/browse/SPARK-24213 though and would like to 
replace this with https://github.com/apache/spark/pull/21274
Could you please close this issue and help with reviewing the other PR?  
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21274: [SPARK-24213][ML] Fix for Int id type for PowerIt...

2018-05-08 Thread jkbradley

GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/21274

[SPARK-24213][ML] Fix for Int id type for PowerIterationClustering in 
spark.ml

## What changes were proposed in this pull request?

PIC in spark.ml has tests for "id" type IntegerType, but those tests did 
not fully check the result.  This patch:
* fixes the unit test to break for the existing implementation
* fixes the implementation

## How was this patch tested?

Existing unit tests, with fixes

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark SPARK-24213

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21274.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21274


commit cd02df5a995ac7daa2e76dcc13c54b987265b6e4
Author: Joseph K. Bradley <joseph@...>
Date:   2018-05-08T22:02:33Z

fix for Int id type for PowerIterationClustering in spark.ml




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-05-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r186865863
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -367,11 +367,31 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 
   test("model evaluateEachIteration") {
 val gbt = new GBTClassifier()
+  .setSeed(1L)
   .setMaxDepth(2)
-  .setMaxIter(2)
-val model = gbt.fit(trainData.toDF)
-val eval = model.evaluateEachIteration(validationData.toDF)
-assert(Vectors.dense(eval) ~== Vectors.dense(1.7641, 1.8209) relTol 
1E-3)
+  .setMaxIter(3)
+  .setLossType("logistic")
+val model3 = gbt.fit(trainData.toDF)
+val model1 = new GBTClassificationModel("gbt-cls-model-test1",
+  model3.trees.take(1), model3.treeWeights.take(1), 
model3.numFeatures, model3.numClasses)
+val model2 = new GBTClassificationModel("gbt-cls-model-test2",
+  model3.trees.take(2), model3.treeWeights.take(2), 
model3.numFeatures, model3.numClasses)
+
+for (evalLossType <- GBTClassifier.supportedLossTypes) {
--- End diff --

evalLossType is not used, so I'd remove this loop.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21272: [MINOR][ML][DOC] Improved Naive Bayes user guide explana...

2018-05-08 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21272
  
![screen shot 2018-05-08 at 2 03 13 
pm](https://user-images.githubusercontent.com/5084283/39783013-a1650846-52c8-11e8-8f15-42b93dd51168.png)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21272: [MINOR][ML][DOC] Improved Naive Bayes user guide ...

2018-05-08 Thread jkbradley

GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/21272

[MINOR][ML][DOC] Improved Naive Bayes user guide explanation

## What changes were proposed in this pull request?

This copies the material from the spark.mllib user guide page for Naive 
Bayes to the spark.ml user guide page.  I also improved the wording and 
organization slightly.

## How was this patch tested?

Built docs locally.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark nb-doc-update

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21272.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21272


commit 987448d58e58748d8f52b9b6b2036964e8111a15
Author: Joseph K. Bradley <joseph@...>
Date:   2018-05-08T20:38:37Z

improved Naive Bayes user guide




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r186572477
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -497,6 +498,9 @@ private[ml] trait GBTParams extends TreeEnsembleParams 
with HasMaxIter with HasS
   @deprecated("This method is deprecated and will be removed in 3.0.0.", 
"2.1.0")
   def setStepSize(value: Double): this.type = set(stepSize, value)
 
+  /** @group setParam */
+  def setValidationIndicatorCol(value: String): this.type = 
set(validationIndicatorCol, value)
--- End diff --

Since we need this setter to be in the concrete classes for Java 
compatibility, don't bother putting it here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r186572374
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +366,50 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("runWithValidation stops early and performs better on a validation 
dataset") {
+val validationIndicatorCol = "validationIndicator"
+val trainDF = trainData.toDF().withColumn(validationIndicatorCol, 
lit(false))
+val validationDF = 
validationData.toDF().withColumn(validationIndicatorCol, lit(true))
+
+val numIter = 20
+for (lossType <- GBTClassifier.supportedLossTypes) {
+  val gbt = new GBTClassifier()
+.setSeed(123)
+.setMaxDepth(2)
+.setLossType(lossType)
+.setMaxIter(numIter)
+  val modelWithoutValidation = gbt.fit(trainDF)
+
+  gbt.setValidationIndicatorCol(validationIndicatorCol)
+  val modelWithValidation = gbt.fit(trainDF.union(validationDF))
+
+  // early stop
+  assert(modelWithValidation.numTrees < numIter)
+
+  val (errorWithoutValidation, errorWithValidation) = {
+val remappedRdd = validationData.map(x => new LabeledPoint(2 * 
x.label - 1, x.features))
+(GradientBoostedTrees.computeError(remappedRdd, 
modelWithoutValidation.trees,
+  modelWithoutValidation.treeWeights, 
modelWithoutValidation.getOldLossType),
+  GradientBoostedTrees.computeError(remappedRdd, 
modelWithValidation.trees,
+modelWithValidation.treeWeights, 
modelWithValidation.getOldLossType))
+  }
+  assert(errorWithValidation <= errorWithoutValidation)
--- End diff --

It'd be nice to have this be strictly true. Is it not?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r186573136
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -146,20 +146,40 @@ class GBTClassifier @Since("1.4.0") (
   @Since("1.4.0")
   def setLossType(value: String): this.type = set(lossType, value)
 
+  /** @group setParam */
+  @Since("2.4.0")
+  override def setValidationIndicatorCol(value: String): this.type = {
+set(validationIndicatorCol, value)
+  }
+
   override protected def train(dataset: Dataset[_]): 
GBTClassificationModel = {
 val categoricalFeatures: Map[Int, Int] =
   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
+
+val withValidation = isDefined(validationIndicatorCol)
+
 // We copy and modify this from Classifier.extractLabeledPoints since 
GBT only supports
 // 2 classes now.  This lets us provide a more precise error message.
-val oldDataset: RDD[LabeledPoint] =
+val convert2LabelPoint = (dataset: Dataset[_]) => {
--- End diff --

nit: "LabelPoint" -> "LabeledPoint"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r186569928
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +366,50 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("runWithValidation stops early and performs better on a validation 
dataset") {
+val validationIndicatorCol = "validationIndicator"
+val trainDF = trainData.toDF().withColumn(validationIndicatorCol, 
lit(false))
+val validationDF = 
validationData.toDF().withColumn(validationIndicatorCol, lit(true))
+
+val numIter = 20
+for (lossType <- GBTClassifier.supportedLossTypes) {
+  val gbt = new GBTClassifier()
+.setSeed(123)
+.setMaxDepth(2)
+.setLossType(lossType)
+.setMaxIter(numIter)
+  val modelWithoutValidation = gbt.fit(trainDF)
+
+  gbt.setValidationIndicatorCol(validationIndicatorCol)
+  val modelWithValidation = gbt.fit(trainDF.union(validationDF))
+
+  // early stop
+  assert(modelWithValidation.numTrees < numIter)
--- End diff --

Let's also assert `modelWithoutValidation.numTrees == numIter`.  That's 
true now, but I could imagine it changing later on if we add a convergence 
tolerance to the algorithm.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r186572853
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -146,20 +146,40 @@ class GBTClassifier @Since("1.4.0") (
   @Since("1.4.0")
   def setLossType(value: String): this.type = set(lossType, value)
 
+  /** @group setParam */
+  @Since("2.4.0")
+  override def setValidationIndicatorCol(value: String): this.type = {
+set(validationIndicatorCol, value)
+  }
+
   override protected def train(dataset: Dataset[_]): 
GBTClassificationModel = {
 val categoricalFeatures: Map[Int, Int] =
   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
+
+val withValidation = isDefined(validationIndicatorCol)
--- End diff --

Let's do: `isDefined(validationIndicatorCol) and 
$(validationIndicatorCol).nonEmpty` (so we count "" as not defined).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-05-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r186565756
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala ---
@@ -269,6 +269,20 @@ class GBTRegressionModel private[ml](
 new OldGBTModel(OldAlgo.Regression, _trees.map(_.toOld), _treeWeights)
   }
 
+  /**
+   * Method to compute error or loss for every iteration of gradient 
boosting.
+   *
+   * @param dataset Dataset for validation.
--- End diff --

Add doc for "loss" arg, including what the options are


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan

2018-05-07 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master f48bd6bdc -> 76ecd0950


[SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan

## What changes were proposed in this pull request?

PrefixSpan API for spark.ml. New implementation instead of #20810

## How was this patch tested?

TestSuite added.

Author: WeichenXu 

Closes #20973 from WeichenXu123/prefixSpan2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76ecd095
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76ecd095
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76ecd095

Branch: refs/heads/master
Commit: 76ecd095024a658bf68e5db658e4416565b30c17
Parents: f48bd6b
Author: WeichenXu 
Authored: Mon May 7 14:57:14 2018 -0700
Committer: Joseph K. Bradley 
Committed: Mon May 7 14:57:14 2018 -0700

--
 .../org/apache/spark/ml/fpm/PrefixSpan.scala|  96 +
 .../org/apache/spark/mllib/fpm/PrefixSpan.scala |   3 +-
 .../apache/spark/ml/fpm/PrefixSpanSuite.scala   | 136 +++
 3 files changed, 233 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/76ecd095/mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala 
b/mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala
new file mode 100644
index 000..02168fe
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{ArrayType, LongType, StructField, 
StructType}
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining 
Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830;>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining;>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset, rows with 
nulls in this column
+   *are ignored
+   * @param minSupport the minimal support level of the sequential pattern, 
any pattern that
+   *   appears more than (minSupport * size-of-the-dataset) 
times will be output
+   *  (recommended value: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern
+   * (recommended value: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a projected 
database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is run
+   *   (recommended value: `3200`).
+   * @return A `DataFrame` that contains columns of sequence and corresponding 
frequency.
+   * The schema of it will be:
+   *  - `sequence: Seq[Seq[T]]` (T is the item type)
+   *  - `freq: Long`
+   */
+  @Since("2.4.0")
+  def findFrequentSequentialPatterns(
+  dataset:

[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

2018-05-07 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20973
  
Merging with master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20261: [SPARK-22885][ML][TEST] ML test for StructuredStreaming:...

2018-05-07 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20261
  
Merging with master
Thanks @WeichenXu123 !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-22885][ML][TEST] ML test for StructuredStreaming: spark.ml.tuning

2018-05-07 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 1c9c5de95 -> f48bd6bdc


[SPARK-22885][ML][TEST] ML test for StructuredStreaming: spark.ml.tuning

## What changes were proposed in this pull request?

ML test for StructuredStreaming: spark.ml.tuning

## How was this patch tested?

N/A

Author: WeichenXu 

Closes #20261 from WeichenXu123/ml_stream_tuning_test.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f48bd6bd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f48bd6bd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f48bd6bd

Branch: refs/heads/master
Commit: f48bd6bdc5aefd9ec43e2d0ee648d17add7ef554
Parents: 1c9c5de
Author: WeichenXu 
Authored: Mon May 7 14:55:41 2018 -0700
Committer: Joseph K. Bradley 
Committed: Mon May 7 14:55:41 2018 -0700

--
 .../apache/spark/ml/tuning/CrossValidatorSuite.scala | 15 +++
 .../spark/ml/tuning/TrainValidationSplitSuite.scala  | 15 +++
 2 files changed, 22 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f48bd6bd/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala
index 15dade2..e6ee722 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala
@@ -25,17 +25,17 @@ import 
org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressio
 import 
org.apache.spark.ml.classification.LogisticRegressionSuite.generateLogisticInput
 import org.apache.spark.ml.evaluation.{BinaryClassificationEvaluator, 
Evaluator, MulticlassClassificationEvaluator, RegressionEvaluator}
 import org.apache.spark.ml.feature.HashingTF
-import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.param.ParamMap
 import org.apache.spark.ml.param.shared.HasInputCol
 import org.apache.spark.ml.regression.LinearRegression
-import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable, 
MLTestingUtils}
-import org.apache.spark.mllib.util.{LinearDataGenerator, MLlibTestSparkContext}
+import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest, MLTestingUtils}
+import org.apache.spark.mllib.util.LinearDataGenerator
 import org.apache.spark.sql.Dataset
 import org.apache.spark.sql.types.StructType
 
 class CrossValidatorSuite
-  extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest {
+  extends MLTest with DefaultReadWriteTest {
 
   import testImplicits._
 
@@ -66,6 +66,13 @@ class CrossValidatorSuite
 assert(parent.getRegParam === 0.001)
 assert(parent.getMaxIter === 10)
 assert(cvModel.avgMetrics.length === lrParamMaps.length)
+
+val result = 
cvModel.transform(dataset).select("prediction").as[Double].collect()
+testTransformerByGlobalCheckFunc[(Double, Vector)](dataset.toDF(), 
cvModel, "prediction") {
+  rows =>
+val result2 = rows.map(_.getDouble(0))
+assert(result === result2)
+}
   }
 
   test("cross validation with linear regression") {

http://git-wip-us.apache.org/repos/asf/spark/blob/f48bd6bd/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala
index 9024342..cd76acf 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala
@@ -24,17 +24,17 @@ import org.apache.spark.ml.{Estimator, Model}
 import org.apache.spark.ml.classification.{LogisticRegression, 
LogisticRegressionModel, OneVsRest}
 import 
org.apache.spark.ml.classification.LogisticRegressionSuite.generateLogisticInput
 import org.apache.spark.ml.evaluation.{BinaryClassificationEvaluator, 
Evaluator, RegressionEvaluator}
-import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.param.ParamMap
 import org.apache.spark.ml.param.shared.HasInputCol
 import org.apache.spark.ml.regression.LinearRegression
-import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable, 
MLTestingUtils}
-import org.apache.spark.mllib.util.{LinearDataGenerator, MLlibTestSparkContext}
+import org.apache.spark.ml.util.{DefaultReadWriteTest,

spark git commit: [SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions specified in pyspark

2018-05-07 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master d83e96372 -> 56a52e0a5


[SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions 
specified in pyspark

## What changes were proposed in this pull request?

Change FPGrowth from private to private[spark]. If no numPartitions is 
specified, then default value -1 is used. But -1 is only valid in the 
construction function of FPGrowth, but not in setNumPartitions. So I make this 
change and use the constructor directly rather than using set method.
## How was this patch tested?

Unit test is added

Author: Jeff Zhang 

Closes #13493 from zjffdu/SPARK-15750.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/56a52e0a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/56a52e0a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/56a52e0a

Branch: refs/heads/master
Commit: 56a52e0a58fc82ea69e47d0d8c4f905565be7c8b
Parents: d83e963
Author: Jeff Zhang 
Authored: Mon May 7 14:47:58 2018 -0700
Committer: Joseph K. Bradley 
Committed: Mon May 7 14:47:58 2018 -0700

--
 .../apache/spark/mllib/api/python/PythonMLLibAPI.scala  |  5 +
 .../scala/org/apache/spark/mllib/fpm/FPGrowth.scala |  2 +-
 python/pyspark/mllib/tests.py   | 12 
 3 files changed, 14 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/56a52e0a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index b32d3f2..db3f074 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -572,10 +572,7 @@ private[python] class PythonMLLibAPI extends Serializable {
   data: JavaRDD[java.lang.Iterable[Any]],
   minSupport: Double,
   numPartitions: Int): FPGrowthModel[Any] = {
-val fpg = new FPGrowth()
-  .setMinSupport(minSupport)
-  .setNumPartitions(numPartitions)
-
+val fpg = new FPGrowth(minSupport, numPartitions)
 val model = fpg.run(data.rdd.map(_.asScala.toArray))
 new FPGrowthModelWrapper(model)
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/56a52e0a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
index f6b1143..4f2b7e6 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
@@ -162,7 +162,7 @@ object FPGrowthModel extends Loader[FPGrowthModel[_]] {
  *
  */
 @Since("1.3.0")
-class FPGrowth private (
+class FPGrowth private[spark] (
 private var minSupport: Double,
 private var numPartitions: Int) extends Logging with Serializable {
 

http://git-wip-us.apache.org/repos/asf/spark/blob/56a52e0a/python/pyspark/mllib/tests.py
--
diff --git a/python/pyspark/mllib/tests.py b/python/pyspark/mllib/tests.py
index 14d788b..4c2ce13 100644
--- a/python/pyspark/mllib/tests.py
+++ b/python/pyspark/mllib/tests.py
@@ -57,6 +57,7 @@ from pyspark.mllib.linalg import Vector, SparseVector, 
DenseVector, VectorUDT, _
 DenseMatrix, SparseMatrix, Vectors, Matrices, MatrixUDT
 from pyspark.mllib.linalg.distributed import RowMatrix
 from pyspark.mllib.classification import StreamingLogisticRegressionWithSGD
+from pyspark.mllib.fpm import FPGrowth
 from pyspark.mllib.recommendation import Rating
 from pyspark.mllib.regression import LabeledPoint, 
StreamingLinearRegressionWithSGD
 from pyspark.mllib.random import RandomRDDs
@@ -1762,6 +1763,17 @@ class DimensionalityReductionTests(MLlibTestCase):
 self.assertEqualUpToSign(pcs.toArray()[:, k - 1], 
expected_pcs[:, k - 1])
 
 
+class FPGrowthTest(MLlibTestCase):
+
+def test_fpgrowth(self):
+data = [["a", "b", "c"], ["a", "b", "d", "e"], ["a", "c", "e"], ["a", 
"c", "f"]]
+rdd = self.sc.parallelize(data, 2)
+model1 = FPGrowth.train(rdd, 0.6, 2)
+# use default data partition number when numPartitions is not specified
+model2 = FPGrowth.train(rdd, 0.6)
+self.assertEqual(sorted(model1.freqItemsets().collect()),
+ sorted(model2.freqItemsets().collect()))
+
 if __name__ == "__main__":
 from pyspark.mllib.tests

[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2018-05-07 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/13493
  
Merging with master
Thanks all!

@zjffdu Did you want to backport this to branch-2.3 too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21163: [SPARK-24097][ML] Instruments improvements - Rand...

2018-05-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21163#discussion_r185628460
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -22,12 +22,11 @@ import java.io.IOException
 import scala.collection.mutable
 import scala.util.Random
 
-import org.apache.spark.internal.Logging
 import org.apache.spark.ml.classification.DecisionTreeClassificationModel
 import org.apache.spark.ml.feature.LabeledPoint
 import org.apache.spark.ml.regression.DecisionTreeRegressionModel
 import org.apache.spark.ml.tree._
-import org.apache.spark.ml.util.Instrumentation
+import org.apache.spark.ml.util.{Instrumentation, OptionalInstrumentation}
--- End diff --

Instrumentation is no longer needed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21163: [SPARK-24097][ML] Instruments improvements - RandomFores...

2018-05-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21163
  
Regarding logging on executors, are you OK with the proposed plan here?  
https://issues.apache.org/jira/browse/SPARK-23686
Basically, we'll keep using regular Logging on executors, rather than using 
Instrumentation on them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20235: [Spark-22887][ML][TESTS][WIP] ML test for StructuredStre...

2018-05-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20235
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2018-05-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/13493
  
LGTM pending fresh tests
Sorry for the delay!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21185: [SPARK-23894][CORE][SQL] Defensively clear ActiveSession...

2018-05-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21185
  
There have been several of these R tests.  May be from flakiness with CRAN; 
testing locally now (since I didn't see any recent bad commits in R).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21204: [SPARK-24132][ML]Expand instrumentation for class...

2018-05-01 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21204#discussion_r185325228
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
 ---
@@ -97,9 +97,10 @@ class DecisionTreeClassifier @Since("1.4.0") (
   override def setSeed(value: Long): this.type = set(seed, value)
 
   override protected def train(dataset: Dataset[_]): 
DecisionTreeClassificationModel = {
+val instr = Instrumentation.create(this, dataset)
 val categoricalFeatures: Map[Int, Int] =
   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
-val numClasses: Int = getNumClasses(dataset)
+val numClasses: Int = getNumClasses(dataset, instr = 
OptionalInstrumentation.create(instr))
--- End diff --

Also see what else you can log easily (e.g., numFeatures)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21204: [SPARK-24132][ML]Expand instrumentation for class...

2018-05-01 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21204#discussion_r185325179
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
 ---
@@ -97,9 +97,10 @@ class DecisionTreeClassifier @Since("1.4.0") (
   override def setSeed(value: Long): this.type = set(seed, value)
 
   override protected def train(dataset: Dataset[_]): 
DecisionTreeClassificationModel = {
+val instr = Instrumentation.create(this, dataset)
 val categoricalFeatures: Map[Int, Int] =
   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
-val numClasses: Int = getNumClasses(dataset)
+val numClasses: Int = getNumClasses(dataset, instr = 
OptionalInstrumentation.create(instr))
--- End diff --

In cases like this, you can use instr.logNumClasses() instead of relying on 
logging within the getNumClasses() method.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21204: [SPARK-24132][ML]Expand instrumentation for class...

2018-05-01 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21204#discussion_r185324974
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -103,7 +103,10 @@ abstract class Classifier[
* @throws IllegalArgumentException  if metadata does not specify 
numClasses, and the
*   actual numClasses exceeds 
maxNumClasses
*/
-  protected def getNumClasses(dataset: Dataset[_], maxNumClasses: Int = 
100): Int = {
+  protected def getNumClasses(dataset: Dataset[_],
--- End diff --

Since we don't have Instrumentation readily available, I recommend using 
the old logging here.  (That will also avoid this API-breaking change which is 
causing the MiMA failure.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

2018-05-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20973
  
Rerunning tests in case the R CRAN failure was from flakiness


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

2018-05-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21195
  
Rerunning tests in case the R CRAN failure was from flakiness


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20261: [SPARK-22885][ML][TEST] ML test for StructuredStreaming:...

2018-05-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20261
  
LGTM
Will merge after fresh tests
Thanks @WeichenXu123 and @smurakozi !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20319: [SPARK-22884][ML][TESTS] ML test for StructuredStreaming...

2018-05-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20319
  
@smurakozi  Do you have time to update this?  I did a full review, though 
it now has a small merge conflict.  Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...

2018-05-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20973
  
LGTM pending jenkins tests


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21153: [SPARK-24058][ML][PySpark] Default Params in ML s...

2018-05-01 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21153#discussion_r185262870
  
--- Diff: python/pyspark/util.py ---
@@ -61,6 +62,26 @@ def _get_argspec(f):
 return argspec
 
 
+def majorMinorVersion(version):
--- End diff --

Yes please, a few things:
* throw exception when unable to parse
* make it a static method of a class pyspark.util.VersionUtils
* add unit test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...

2018-05-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21195
  
add to whitelist


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21153: [SPARK-24058][ML][PySpark] Default Params in ML s...

2018-04-30 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21153#discussion_r185131834
  
--- Diff: python/pyspark/util.py ---
@@ -61,6 +62,26 @@ def _get_argspec(f):
 return argspec
 
 
+def majorMinorVersion(version):
--- End diff --

Since this affects Spark core, it might be nice to put this in a separate 
PR.  Also, shall we make this match the Scala API?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-04-30 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r185058005
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -44,26 +43,37 @@ object PrefixSpan {
*
* @param dataset A dataset or a dataframe containing a sequence column 
which is
*{{{Seq[Seq[_]]}}} type
-   * @param sequenceCol the name of the sequence column in dataset
+   * @param sequenceCol the name of the sequence column in dataset, rows 
with nulls in this column
+   *are ignored
* @param minSupport the minimal support level of the sequential 
pattern, any pattern that
*   appears more than (minSupport * 
size-of-the-dataset) times will be output
-   *  (default: `0.1`).
-   * @param maxPatternLength the maximal length of the sequential pattern, 
any pattern that appears
-   * less than maxPatternLength will be output 
(default: `10`).
+   *  (recommended value: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern
+   * (recommended value: `10`).
* @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
*   internal storage format) allowed in a 
projected database before
*   local processing. If a projected database 
exceeds this size, another
-   *   iteration of distributed prefix growth is 
run (default: `3200`).
-   * @return A dataframe that contains columns of sequence and 
corresponding frequency.
+   *   iteration of distributed prefix growth is 
run
+   *   (recommended value: `3200`).
+   * @return A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+   * The schema of it will be:
+   *  - `sequence: Seq[Seq[T]]` (T is the item type)
+   *  - `frequency: Long`
--- End diff --

I had asked for this change to "frequency" from "freq," but I belatedly 
realized that this conflicts with the existing FPGrowth API, which uses "freq." 
 It would be best to maintain consistency.  Would you mind reverting to "freq?"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-04-30 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21183#discussion_r184753197
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -473,7 +475,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer 
with Logging {
 None
   }
 
-val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitions { docs =>
+val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitionsWithIndexInternal
--- End diff --

Let's not use mapPartitionsWithIndexInternal; I don't think closure 
cleaning is expensive enough for us to worry about here.  Use 
mapPartitionsWithIndex instead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-04-30 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21183#discussion_r185049467
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -605,14 +609,16 @@ private[clustering] object OnlineLDAOptimizer {
   expElogbeta: BDM[Double],
   alpha: breeze.linalg.Vector[Double],
   gammaShape: Double,
-  k: Int): (BDV[Double], BDM[Double], List[Int]) = {
+  k: Int,
+  seed: Long): (BDV[Double], BDM[Double], List[Int]) = {
 val (ids: List[Int], cts: Array[Double]) = termCounts match {
   case v: DenseVector => ((0 until v.size).toList, v.values)
   case v: SparseVector => (v.indices.toList, v.values)
 }
 // Initialize the variational distribution q(theta|gamma) for the 
mini-batch
+val randBasis = new RandBasis(new 
org.apache.commons.math3.random.MersenneTwister(seed))
 val gammad: BDV[Double] =
-  new Gamma(gammaShape, 1.0 / gammaShape).samplesVector(k) 
  // K
+  new Gamma(gammaShape, 1.0 / gammaShape)(randBasis).samplesVector(k) 
// K
--- End diff --

nit: Note that the original spacing with the comment was intentional to 
match lines below.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-04-27 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r184768987
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala 
---
@@ -95,7 +95,9 @@ private[shared] object SharedParamsCodeGen {
   ParamDesc[String]("distanceMeasure", "The distance measure. 
Supported options: 'euclidean'" +
 " and 'cosine'", 
Some("org.apache.spark.mllib.clustering.DistanceMeasure.EUCLIDEAN"),
 isValid = "(value: String) => " +
-
"org.apache.spark.mllib.clustering.DistanceMeasure.validateDistanceMeasure(value)")
+
"org.apache.spark.mllib.clustering.DistanceMeasure.validateDistanceMeasure(value)"),
+  ParamDesc[String]("validationIndicatorCol", "the indicator column 
name for indicating " +
--- End diff --

How about rephrasing the description: "name of the column that indicates 
whether each row is for training or for validation.  False indicates training; 
true indicates validation."


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 8390 matches

Mail list logo