spark git commit: [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception
Repository: spark Updated Branches: refs/heads/branch-2.4 b632e775c -> 085f731ad [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception ## What changes were proposed in this pull request? mapValues in scala is currently not serializable. To avoid the serialization issue while running pageRank, we need to use map instead of mapValues. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22271 from shahidki31/master_latest. Authored-by: Shahid Signed-off-by: Joseph K. Bradley (cherry picked from commit 3b6591b0b064b13a411e5b8f8ee4883a69c39e2d) Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/085f731a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/085f731a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/085f731a Branch: refs/heads/branch-2.4 Commit: 085f731adb9b8c82a2bf4bbcae6d889a967fbd53 Parents: b632e77 Author: Shahid Authored: Thu Sep 6 09:52:58 2018 -0700 Committer: Joseph K. Bradley Committed: Thu Sep 6 09:53:07 2018 -0700 -- .../main/scala/org/apache/spark/graphx/lib/PageRank.scala| 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/085f731a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala -- diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala index 96b635f..1305c05 100644 --- a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala +++ b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala @@ -198,9 +198,11 @@ object PageRank extends Logging { val zero = Vectors.sparse(sources.size, List()).asBreeze // map of vid -> vector where for each vid, the _position of vid in source_ is set to 1.0 -val sourcesInitMap = sources.zipWithIndex.toMap.mapValues { i => - Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze -} +val sourcesInitMap = sources.zipWithIndex.map { case (vid, i) => + val v = Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze + (vid, v) +}.toMap + val sc = graph.vertices.sparkContext val sourcesInitMapBC = sc.broadcast(sourcesInitMap) // Initialize the PageRank graph with each edge attribute having - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception
Repository: spark Updated Branches: refs/heads/master 7ef6d1daf -> 3b6591b0b [SPARK-25268][GRAPHX] run Parallel Personalized PageRank throws serialization Exception ## What changes were proposed in this pull request? mapValues in scala is currently not serializable. To avoid the serialization issue while running pageRank, we need to use map instead of mapValues. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22271 from shahidki31/master_latest. Authored-by: Shahid Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3b6591b0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3b6591b0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3b6591b0 Branch: refs/heads/master Commit: 3b6591b0b064b13a411e5b8f8ee4883a69c39e2d Parents: 7ef6d1d Author: Shahid Authored: Thu Sep 6 09:52:58 2018 -0700 Committer: Joseph K. Bradley Committed: Thu Sep 6 09:52:58 2018 -0700 -- .../main/scala/org/apache/spark/graphx/lib/PageRank.scala| 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3b6591b0/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala -- diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala index 96b635f..1305c05 100644 --- a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala +++ b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala @@ -198,9 +198,11 @@ object PageRank extends Logging { val zero = Vectors.sparse(sources.size, List()).asBreeze // map of vid -> vector where for each vid, the _position of vid in source_ is set to 1.0 -val sourcesInitMap = sources.zipWithIndex.toMap.mapValues { i => - Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze -} +val sourcesInitMap = sources.zipWithIndex.map { case (vid, i) => + val v = Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze + (vid, v) +}.toMap + val sc = graph.vertices.sparkContext val sourcesInitMapBC = sc.broadcast(sourcesInitMap) // Initialize the PageRank graph with each edge attribute having - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #22271: [SPARK-25268][GraphX]run Parallel Personalized PageRank ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/22271 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22271: [SPARK-25268][GraphX]run Parallel Personalized PageRank ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/22271 LGTM I tested this locally and confirmed it fixes the serialization issue. Thank you @shahidki31 ! Merging with master after fresh tests finish --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values backport to 2.3
Repository: spark Updated Branches: refs/heads/branch-2.3 42c1fdd22 -> f5983823e [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values backport to 2.3 ## What changes were proposed in this pull request? In feature.py, VectorSizeHint setSize and getSize don't return value. Add return. (Please fill in changes proposed in this fix) ## How was this patch tested? Unit Test added Closes #8 from huaxingao/spark-25124-2.3. Authored-by: Huaxin Gao Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5983823 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5983823 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5983823 Branch: refs/heads/branch-2.3 Commit: f5983823e9b4a3b4762481306ea071a73f5742fc Parents: 42c1fdd Author: Huaxin Gao Authored: Fri Aug 24 15:41:18 2018 -0700 Committer: Joseph K. Bradley Committed: Fri Aug 24 15:41:18 2018 -0700 -- python/pyspark/ml/feature.py | 4 ++-- python/pyspark/ml/tests.py | 17 + 2 files changed, 19 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f5983823/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 04b07e6..a444fe0 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -3673,12 +3673,12 @@ class VectorSizeHint(JavaTransformer, HasInputCol, HasHandleInvalid, JavaMLReada @since("2.3.0") def getSize(self): """ Gets size param, the size of vectors in `inputCol`.""" -self.getOrDefault(self.size) +return self.getOrDefault(self.size) @since("2.3.0") def setSize(self, value): """ Sets size param, the size of vectors in `inputCol`.""" -self._set(size=value) +return self._set(size=value) if __name__ == "__main__": http://git-wip-us.apache.org/repos/asf/spark/blob/f5983823/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index 1af2b91..49912d2 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -678,6 +678,23 @@ class FeatureTests(SparkSessionTestCase): expected2 = [Row(id=0, indexed=0.0), Row(id=1, indexed=1.0)] self.assertEqual(actual2, expected2) +def test_vector_size_hint(self): +df = self.spark.createDataFrame( +[(0, Vectors.dense([0.0, 10.0, 0.5])), + (1, Vectors.dense([1.0, 11.0, 0.5, 0.6])), + (2, Vectors.dense([2.0, 12.0]))], +["id", "vector"]) + +sizeHint = VectorSizeHint( +inputCol="vector", +handleInvalid="skip") +sizeHint.setSize(3) +self.assertEqual(sizeHint.getSize(), 3) + +output = sizeHint.transform(df).head().vector +expected = DenseVector([0.0, 10.0, 0.5]) +self.assertEqual(output, expected) + class HasInducedError(Params): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #22228: [SPARK-25124][ML]VectorSizeHint setSize and getSize don'...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/8 Awesome, thank you! LGTM Merging with branch-2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSize don'...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/22136 Well, this merged successfully with master but not with 2.3; it seemed to pull in code from another PR, strangely. Would you mind sending a backport PR against branch-2.3? Thank you! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values
Repository: spark Updated Branches: refs/heads/master 8ed044928 -> b5e118808 [SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values ## What changes were proposed in this pull request? In feature.py, VectorSizeHint setSize and getSize don't return value. Add return. ## How was this patch tested? I tested the changes on my local. Closes #22136 from huaxingao/spark-25124. Authored-by: Huaxin Gao Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b5e11880 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b5e11880 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b5e11880 Branch: refs/heads/master Commit: b5e11880871d6ef31efe3ec42b3caa0fc403e71b Parents: 8ed0449 Author: Huaxin Gao Authored: Thu Aug 23 16:17:27 2018 -0700 Committer: Joseph K. Bradley Committed: Thu Aug 23 16:17:27 2018 -0700 -- python/pyspark/ml/feature.py | 4 ++-- python/pyspark/ml/tests.py | 17 + 2 files changed, 19 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b5e11880/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index ddba738..760aa82 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -3843,12 +3843,12 @@ class VectorSizeHint(JavaTransformer, HasInputCol, HasHandleInvalid, JavaMLReada @since("2.3.0") def getSize(self): """ Gets size param, the size of vectors in `inputCol`.""" -self.getOrDefault(self.size) +return self.getOrDefault(self.size) @since("2.3.0") def setSize(self, value): """ Sets size param, the size of vectors in `inputCol`.""" -self._set(size=value) +return self._set(size=value) if __name__ == "__main__": http://git-wip-us.apache.org/repos/asf/spark/blob/b5e11880/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index a770bad..5c87d1d 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -844,6 +844,23 @@ class FeatureTests(SparkSessionTestCase): .select(model_default.getOrDefault(model_default.outputCol)).collect() self.assertEqual(len(transformed_list), 5) +def test_vector_size_hint(self): +df = self.spark.createDataFrame( +[(0, Vectors.dense([0.0, 10.0, 0.5])), + (1, Vectors.dense([1.0, 11.0, 0.5, 0.6])), + (2, Vectors.dense([2.0, 12.0]))], +["id", "vector"]) + +sizeHint = VectorSizeHint( +inputCol="vector", +handleInvalid="skip") +sizeHint.setSize(3) +self.assertEqual(sizeHint.getSize(), 3) + +output = sizeHint.transform(df).head().vector +expected = DenseVector([0.0, 10.0, 0.5]) +self.assertEqual(output, expected) + class HasInducedError(Params): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSize don'...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/22136 LGTM Merging with master. I'll try to backport it to 2.3 too. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/22136#discussion_r212483950 --- Diff: python/pyspark/ml/tests.py --- @@ -844,6 +844,28 @@ def test_string_indexer_from_labels(self): .select(model_default.getOrDefault(model_default.outputCol)).collect() self.assertEqual(len(transformed_list), 5) +def test_vector_size_hint(self): --- End diff -- Thanks! FYI this still isn't really testing the return value of setSize, but I think it's OK since we don't really do that anywhere else : P and I'm confident in the above change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSi...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/22136#discussion_r212069129 --- Diff: python/pyspark/ml/tests.py --- @@ -844,6 +844,28 @@ def test_string_indexer_from_labels(self): .select(model_default.getOrDefault(model_default.outputCol)).collect() self.assertEqual(len(transformed_list), 5) +def test_vector_size_hint(self): --- End diff -- This test doesn't test the 2 functions which were buggy. Could you please test those (and simplify the test if possible)? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-25149][GRAPHX] Update Parallel Personalized Page Rank to test with large vertexIds
Repository: spark Updated Branches: refs/heads/master 99d2e4e00 -> 72ecfd095 [SPARK-25149][GRAPHX] Update Parallel Personalized Page Rank to test with large vertexIds ## What changes were proposed in this pull request? runParallelPersonalizedPageRank in graphx checks that `sources` are <= Int.MaxValue.toLong, but this is not actually required. This check seems to have been added because we use sparse vectors in the implementation and sparse vectors cannot be indexed by values > MAX_INT. However we do not ever index the sparse vector by the source vertexIds so this isn't an issue. I've added a test with large vertexIds to confirm this works as expected. ## How was this patch tested? Unit tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22139 from MrBago/remove-veretexId-check-pppr. Authored-by: Bago Amirbekian Signed-off-by: Joseph K. Bradley Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/72ecfd09 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/72ecfd09 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/72ecfd09 Branch: refs/heads/master Commit: 72ecfd095062ad61c073f9b97bf3c47644575d60 Parents: 99d2e4e Author: Bago Amirbekian Authored: Tue Aug 21 15:21:55 2018 -0700 Committer: Joseph K. Bradley Committed: Tue Aug 21 15:21:55 2018 -0700 -- .../org/apache/spark/graphx/lib/PageRank.scala | 28 ++--- .../apache/spark/graphx/lib/PageRankSuite.scala | 32 +++- 2 files changed, 35 insertions(+), 25 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/72ecfd09/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala -- diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala index ebd65e8..96b635f 100644 --- a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala +++ b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala @@ -184,9 +184,11 @@ object PageRank extends Logging { * indexed by the position of nodes in the sources list) and * edge attributes the normalized edge weight */ - def runParallelPersonalizedPageRank[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED], -numIter: Int, resetProb: Double = 0.15, -sources: Array[VertexId]): Graph[Vector, Double] = { + def runParallelPersonalizedPageRank[VD: ClassTag, ED: ClassTag]( + graph: Graph[VD, ED], + numIter: Int, + resetProb: Double = 0.15, + sources: Array[VertexId]): Graph[Vector, Double] = { require(numIter > 0, s"Number of iterations must be greater than 0," + s" but got ${numIter}") require(resetProb >= 0 && resetProb <= 1, s"Random reset probability must belong" + @@ -194,15 +196,11 @@ object PageRank extends Logging { require(sources.nonEmpty, s"The list of sources must be non-empty," + s" but got ${sources.mkString("[", ",", "]")}") -// TODO if one sources vertex id is outside of the int range -// we won't be able to store its activations in a sparse vector -require(sources.max <= Int.MaxValue.toLong, - s"This implementation currently only works for source vertex ids at most ${Int.MaxValue}") val zero = Vectors.sparse(sources.size, List()).asBreeze -val sourcesInitMap = sources.zipWithIndex.map { case (vid, i) => - val v = Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze - (vid, v) -}.toMap +// map of vid -> vector where for each vid, the _position of vid in source_ is set to 1.0 +val sourcesInitMap = sources.zipWithIndex.toMap.mapValues { i => + Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze +} val sc = graph.vertices.sparkContext val sourcesInitMapBC = sc.broadcast(sourcesInitMap) // Initialize the PageRank graph with each edge attribute having @@ -212,13 +210,7 @@ object PageRank extends Logging { .outerJoinVertices(graph.outDegrees) { (vid, vdata, deg) => deg.getOrElse(0) } // Set the weight on the edges based on the degree .mapTriplets(e => 1.0 / e.srcAttr, TripletFields.Src) - .mapVertices { (vid, attr) => -if (sourcesInitMapBC.value contains vid) { - sourcesInitMapBC.value(vid) -} else { - zero -} - } + .mapVertices((vid, _) => sourcesInitMapBC.value.getOrElse(vid, zero)) var i = 0 while (i < numIter) { http://git-wip-us.apache.org/repos/asf/spark/blob/72ecfd09/graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala -- diff --git
[GitHub] spark issue #22139: [SPARK-25149][GraphX] Update Parallel Personalized Page ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/22139 LGTM I'll merge this with master Thanks @MrBago ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-24852][ML] Update spark.ml to use Instrumentation.instrumented.
Repository: spark Updated Branches: refs/heads/master 244bcff19 -> 3cb1b5780 [SPARK-24852][ML] Update spark.ml to use Instrumentation.instrumented. ## What changes were proposed in this pull request? Followup for #21719. Update spark.ml training code to fully wrap instrumented methods and remove old instrumentation APIs. ## How was this patch tested? existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago Amirbekian Closes #21799 from MrBago/new-instrumentation-apis2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3cb1b578 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3cb1b578 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3cb1b578 Branch: refs/heads/master Commit: 3cb1b57809d0b4a93223669f5c10cea8fc53eff6 Parents: 244bcff Author: Bago Amirbekian Authored: Fri Jul 20 12:13:15 2018 -0700 Committer: Joseph K. Bradley Committed: Fri Jul 20 12:13:15 2018 -0700 -- .../classification/DecisionTreeClassifier.scala | 24 +- .../spark/ml/classification/GBTClassifier.scala | 14 +++--- .../spark/ml/classification/LinearSVC.scala | 12 ++--- .../ml/classification/LogisticRegression.scala | 2 +- .../MultilayerPerceptronClassifier.scala| 14 +++--- .../spark/ml/classification/NaiveBayes.scala| 12 ++--- .../spark/ml/classification/OneVsRest.scala | 9 ++-- .../classification/RandomForestClassifier.scala | 13 +++--- .../spark/ml/clustering/BisectingKMeans.scala | 12 ++--- .../spark/ml/clustering/GaussianMixture.scala | 12 ++--- .../org/apache/spark/ml/clustering/KMeans.scala | 9 ++-- .../org/apache/spark/ml/clustering/LDA.scala| 12 ++--- .../org/apache/spark/ml/fpm/FPGrowth.scala | 12 ++--- .../apache/spark/ml/recommendation/ALS.scala| 9 ++-- .../ml/regression/AFTSurvivalRegression.scala | 13 +++--- .../ml/regression/DecisionTreeRegressor.scala | 24 +- .../spark/ml/regression/GBTRegressor.scala | 12 ++--- .../GeneralizedLinearRegression.scala | 12 ++--- .../ml/regression/IsotonicRegression.scala | 12 ++--- .../spark/ml/regression/LinearRegression.scala | 21 - .../ml/regression/RandomForestRegressor.scala | 14 +++--- .../apache/spark/ml/tuning/CrossValidator.scala | 9 ++-- .../spark/ml/tuning/TrainValidationSplit.scala | 9 ++-- .../apache/spark/ml/util/Instrumentation.scala | 47 ++-- 24 files changed, 153 insertions(+), 186 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3cb1b578/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala index c9786f1..8a57bfc 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala @@ -29,6 +29,7 @@ import org.apache.spark.ml.tree._ import org.apache.spark.ml.tree.DecisionTreeModelReadWrite._ import org.apache.spark.ml.tree.impl.RandomForest import org.apache.spark.ml.util._ +import org.apache.spark.ml.util.Instrumentation.instrumented import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo, Strategy => OldStrategy} import org.apache.spark.mllib.tree.model.{DecisionTreeModel => OldDecisionTreeModel} import org.apache.spark.rdd.RDD @@ -96,8 +97,10 @@ class DecisionTreeClassifier @Since("1.4.0") ( @Since("1.6.0") override def setSeed(value: Long): this.type = set(seed, value) - override protected def train(dataset: Dataset[_]): DecisionTreeClassificationModel = { -val instr = Instrumentation.create(this, dataset) + override protected def train( + dataset: Dataset[_]): DecisionTreeClassificationModel = instrumented { instr => +instr.logPipelineStage(this) +instr.logDataset(dataset) val categoricalFeatures: Map[Int, Int] = MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) val numClasses: Int = getNumClasses(dataset) @@ -112,30 +115,27 @@ class DecisionTreeClassifier @Since("1.4.0") ( val oldDataset: RDD[LabeledPoint] = extractLabeledPoints(dataset, numClasses) val strategy = getOldStrategy(categoricalFeatures, numClasses) -instr.logParams(maxDepth, maxBins, minInstancesPerNode, minInfoGain, maxMemoryInMB, +instr.logParams(this, maxDepth, maxBins, minInstancesPerNode, minInfoGain, maxMemoryInMB, cacheNodeIds, checkpointInterval, impurity, seed) val trees = RandomForest.run(oldDataset, strategy, numTrees = 1,
[GitHub] spark issue #21799: [SPARK-24852][ML] Update spark.ml to use Instrumentation...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21799 LGTM Merging with master Thanks @MrBago ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-24747][ML] Make Instrumentation class more flexible
Repository: spark Updated Branches: refs/heads/master 7688ce88b -> 912634b00 [SPARK-24747][ML] Make Instrumentation class more flexible ## What changes were proposed in this pull request? This PR updates the Instrumentation class to make it more flexible and a little bit easier to use. When these APIs are merged, I'll followup with a PR to update the training code to use these new APIs so we can remove the old APIs. These changes are all to private APIs so this PR doesn't make any user facing changes. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago Amirbekian Closes #21719 from MrBago/new-instrumentation-apis. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/912634b0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/912634b0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/912634b0 Branch: refs/heads/master Commit: 912634b004c2302533a8a8501b4ecb803d17e335 Parents: 7688ce8 Author: Bago Amirbekian Authored: Tue Jul 17 13:11:52 2018 -0700 Committer: Joseph K. Bradley Committed: Tue Jul 17 13:11:52 2018 -0700 -- .../ml/classification/LogisticRegression.scala | 8 +- .../spark/ml/tree/impl/RandomForest.scala | 2 +- .../spark/ml/tuning/ValidatorParams.scala | 2 +- .../apache/spark/ml/util/Instrumentation.scala | 128 --- .../apache/spark/mllib/clustering/KMeans.scala | 4 +- 5 files changed, 93 insertions(+), 51 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/912634b0/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala index 92e342e..25fb9c8 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala @@ -35,6 +35,7 @@ import org.apache.spark.ml.optim.loss.{L2Regularization, RDDLossFunction} import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ import org.apache.spark.ml.util._ +import org.apache.spark.ml.util.Instrumentation.instrumented import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics} import org.apache.spark.mllib.linalg.VectorImplicits._ import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer @@ -490,7 +491,7 @@ class LogisticRegression @Since("1.2.0") ( protected[spark] def train( dataset: Dataset[_], - handlePersistence: Boolean): LogisticRegressionModel = { + handlePersistence: Boolean): LogisticRegressionModel = instrumented { instr => val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol)) val instances: RDD[Instance] = dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map { @@ -500,7 +501,8 @@ class LogisticRegression @Since("1.2.0") ( if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) -val instr = Instrumentation.create(this, dataset) +instr.logPipelineStage(this) +instr.logDataset(dataset) instr.logParams(regParam, elasticNetParam, standardization, threshold, maxIter, tol, fitIntercept) @@ -905,8 +907,6 @@ class LogisticRegression @Since("1.2.0") ( objectiveHistory) } model.setSummary(Some(logRegSummary)) -instr.logSuccess(model) -model } @Since("1.4.0") http://git-wip-us.apache.org/repos/asf/spark/blob/912634b0/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala b/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala index 9058701..bb3f3a0 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala @@ -91,7 +91,7 @@ private[spark] object RandomForest extends Logging { numTrees: Int, featureSubsetStrategy: String, seed: Long, - instr: Option[Instrumentation[_]], + instr: Option[Instrumentation], prune: Boolean = true, // exposed for testing only, real trees are always pruned parentUID: Option[String] = None): Array[DecisionTreeModel] = { http://git-wip-us.apache.org/repos/asf/spark/blob/912634b0/mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala
[GitHub] spark issue #21719: [SPARK-24747][ML] Make Instrumentation class more flexib...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21719 LGTM Merging with master Thanks @MrBago ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21719: [SPARK-24747][ML] Make Instrumentation class more...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21719#discussion_r202800969 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -488,9 +488,10 @@ class LogisticRegression @Since("1.2.0") ( train(dataset, handlePersistence) } + import Instrumentation.instrumented --- End diff -- Put import at top of file with the other imports (just to make imports easier to track). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21719: [SPARK-24747][ML] Make Instrumentation class more...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21719#discussion_r202805710 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala --- @@ -19,45 +19,60 @@ package org.apache.spark.ml.util import java.util.UUID -import scala.reflect.ClassTag +import scala.util.{Failure, Success, Try} +import scala.util.control.NonFatal import org.json4s._ import org.json4s.JsonDSL._ import org.json4s.jackson.JsonMethods._ import org.apache.spark.internal.Logging -import org.apache.spark.ml.{Estimator, Model} -import org.apache.spark.ml.param.Param +import org.apache.spark.ml.{Estimator, Model, PipelineStage} +import org.apache.spark.ml.param.{Param, Params} import org.apache.spark.rdd.RDD import org.apache.spark.sql.Dataset import org.apache.spark.util.Utils /** * A small wrapper that defines a training session for an estimator, and some methods to log * useful information during this session. - * - * A new instance is expected to be created within fit(). - * - * @param estimator the estimator that is being fit - * @param dataset the training dataset - * @tparam E the type of the estimator */ -private[spark] class Instrumentation[E <: Estimator[_]] private ( -val estimator: E, -val dataset: RDD[_]) extends Logging { +private[spark] class Instrumentation extends Logging { private val id = UUID.randomUUID() - private val prefix = { + private val shortId = id.toString.take(8) + private val prefix = s"[$shortId] " + + // TODO: update spark.ml to use new Instrumentation APIs and remove this constructor + var stage: Params = _ --- End diff -- I'd recommend we either plan to remove "stage" or change "logPipelineStage" so it only allows setting "stage" once. If we go with the former, how about leaving a note to remove "stage" once spark.ml code is migrated to use the new logParams() method? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21719: [SPARK-24747][ML] Make Instrumentation class more flexib...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21719 jenkins test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20442: [SPARK-23265][ML]Update multi-column error handling logi...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20442 @huaxingao Thanks for this follow-up! I realized that https://github.com/apache/spark/pull/19715 introduced a breaking change which we missed in Spark 2.3 QA: In Spark 2.2, a user could set inputCol but not set outputCol (since outputCol has a default value). The new check causes such user code to start failing in Spark 2.3. Since you're already working on this follow-up, would you mind adding a unit test which checks this? (Setting inputCol but not outputCol, and making sure that works.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21129 I made a JIRA for the Python part of this: https://issues.apache.org/jira/browse/SPARK-24333 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-7132][ML] Add fit with validation set to spark.ml GBT
Repository: spark Updated Branches: refs/heads/master a33dcf4a0 -> ffaefe755 [SPARK-7132][ML] Add fit with validation set to spark.ml GBT ## What changes were proposed in this pull request? Add fit with validation set to spark.ml GBT ## How was this patch tested? Will add later. Author: WeichenXuCloses #21129 from WeichenXu123/gbt_fit_validation. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ffaefe75 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ffaefe75 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ffaefe75 Branch: refs/heads/master Commit: ffaefe755e20cb94e27f07b233615a4bbb476679 Parents: a33dcf4 Author: WeichenXu Authored: Mon May 21 13:05:17 2018 -0700 Committer: Joseph K. Bradley Committed: Mon May 21 13:05:17 2018 -0700 -- .../spark/ml/classification/GBTClassifier.scala | 38 +--- .../ml/param/shared/SharedParamsCodeGen.scala | 5 +- .../spark/ml/param/shared/sharedParams.scala| 17 +++ .../spark/ml/regression/GBTRegressor.scala | 31 +++-- .../org/apache/spark/ml/tree/treeParams.scala | 41 - .../ml/classification/GBTClassifierSuite.scala | 46 +++ .../spark/ml/regression/GBTRegressorSuite.scala | 48 +++- project/MimaExcludes.scala | 13 +- 8 files changed, 213 insertions(+), 26 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ffaefe75/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala index 3fb6d1e..337133a 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala @@ -146,12 +146,21 @@ class GBTClassifier @Since("1.4.0") ( @Since("1.4.0") def setLossType(value: String): this.type = set(lossType, value) + /** @group setParam */ + @Since("2.4.0") + def setValidationIndicatorCol(value: String): this.type = { +set(validationIndicatorCol, value) + } + override protected def train(dataset: Dataset[_]): GBTClassificationModel = { val categoricalFeatures: Map[Int, Int] = MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) + +val withValidation = isDefined(validationIndicatorCol) && $(validationIndicatorCol).nonEmpty + // We copy and modify this from Classifier.extractLabeledPoints since GBT only supports // 2 classes now. This lets us provide a more precise error message. -val oldDataset: RDD[LabeledPoint] = +val convert2LabeledPoint = (dataset: Dataset[_]) => { dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map { case Row(label: Double, features: Vector) => require(label == 0 || label == 1, s"GBTClassifier was given" + @@ -159,7 +168,18 @@ class GBTClassifier @Since("1.4.0") ( s" GBTClassifier currently only supports binary classification.") LabeledPoint(label, features) } -val numFeatures = oldDataset.first().features.size +} + +val (trainDataset, validationDataset) = if (withValidation) { + ( + convert2LabeledPoint(dataset.filter(not(col($(validationIndicatorCol), +convert2LabeledPoint(dataset.filter(col($(validationIndicatorCol + ) +} else { + (convert2LabeledPoint(dataset), null) +} + +val numFeatures = trainDataset.first().features.size val boostingStrategy = super.getOldBoostingStrategy(categoricalFeatures, OldAlgo.Classification) val numClasses = 2 @@ -169,15 +189,21 @@ class GBTClassifier @Since("1.4.0") ( s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}") } -val instr = Instrumentation.create(this, oldDataset) +val instr = Instrumentation.create(this, dataset) instr.logParams(labelCol, featuresCol, predictionCol, impurity, lossType, maxDepth, maxBins, maxIter, maxMemoryInMB, minInfoGain, minInstancesPerNode, - seed, stepSize, subsamplingRate, cacheNodeIds, checkpointInterval, featureSubsetStrategy) + seed, stepSize, subsamplingRate, cacheNodeIds, checkpointInterval, featureSubsetStrategy, + validationIndicatorCol) instr.logNumFeatures(numFeatures) instr.logNumClasses(numClasses) -val (baseLearners, learnerWeights) = GradientBoostedTrees.run(oldDataset, boostingStrategy, - $(seed), $(featureSubsetStrategy)) +val (baseLearners, learnerWeights) = if
[GitHub] spark issue #21129: [SPARK-7132][ML] Add fit with validation set to spark.ml...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21129 LGTM Merging with master Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21163: [SPARK-24097][ML] Instrumentation improvements - RandomF...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21163 jenkins test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20319: [SPARK-22884][ML][TESTS] ML test for StructuredStreaming...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20319 Done! Here it is: https://github.com/apache/spark/pull/21358 @smurakozi Could you please close this issue and help review the new PR if you have time? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21358: [SPARK-22884][ML] ML tests for StructuredStreamin...
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/21358 [SPARK-22884][ML] ML tests for StructuredStreaming: spark.ml.clustering ## What changes were proposed in this pull request? Converting clustering tests to also check code with structured streaming, using the ML testing infrastructure implemented in SPARK-22882. This PR is a new version of https://github.com/apache/spark/pull/20319 You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark smurakozi-SPARK-22884 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21358.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21358 commit 496bc7ba688260317fe5907a8057347a5b184490 Author: Sandor Murakozi <smurakozi@...> Date: 2018-01-18T19:03:33Z Converted all clustering tests to check streaming commit 58239ba3edc0c60900d842caf3d38e30507d885f Author: Sandor Murakozi <smurakozi@...> Date: 2018-01-18T20:12:46Z formatting, nits commit eb66659e8b65de9d539584b935899838e1dd9ed8 Author: Joseph K. Bradley <joseph@...> Date: 2018-05-17T21:26:17Z cleanups commit 95f6f2919a5ab0853434592000f5b811915ca15f Author: Joseph K. Bradley <joseph@...> Date: 2018-05-17T22:11:20Z more cleanups --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20319: [SPARK-22884][ML][TESTS] ML test for StructuredStreaming...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20319 I'm going to take this over to get this done, but @smurakozi you'll be the primary author. I'll link the PR here in a minute --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21163: [SPARK-24097][ML] Instrumentation improvements - RandomF...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21163 LGTM pending fresh tests Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21344: [SPARK-24114] Add instrumentation to FPGrowth.
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21344 Whoops! This was merged linked to the wrong JIRA. It should have been: https://issues.apache.org/jira/browse/SPARK-24310 for the record. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-24114] Add instrumentation to FPGrowth.
Repository: spark Updated Branches: refs/heads/master a7a9b1837 -> 439c69511 [SPARK-24114] Add instrumentation to FPGrowth. ## What changes were proposed in this pull request? Have FPGrowth keep track of model training using the Instrumentation class. ## How was this patch tested? manually Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago AmirbekianCloses #21344 from MrBago/fpgrowth-instr. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/439c6951 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/439c6951 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/439c6951 Branch: refs/heads/master Commit: 439c69511812776cb4b82956547ce958d0669c52 Parents: a7a9b18 Author: Bago Amirbekian Authored: Thu May 17 13:42:10 2018 -0700 Committer: Joseph K. Bradley Committed: Thu May 17 13:42:10 2018 -0700 -- mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/439c6951/mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala b/mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala index 0bf405d..d7fbe28 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala @@ -161,6 +161,8 @@ class FPGrowth @Since("2.2.0") ( private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel = { val handlePersistence = dataset.storageLevel == StorageLevel.NONE +val instr = Instrumentation.create(this, dataset) +instr.logParams(params: _*) val data = dataset.select($(itemsCol)) val items = data.where(col($(itemsCol)).isNotNull).rdd.map(r => r.getSeq[Any](0).toArray) val mllibFP = new MLlibFPGrowth().setMinSupport($(minSupport)) @@ -183,7 +185,9 @@ class FPGrowth @Since("2.2.0") ( items.unpersist() } -copyValues(new FPGrowthModel(uid, frequentItems)).setParent(this) +val model = copyValues(new FPGrowthModel(uid, frequentItems)).setParent(this) +instr.logSuccess(model) +model } @Since("2.2.0") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #21344: [SPARK-24114] Add instrumentation to FPGrowth.
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21344 LGTM It'll be nice when we move the implementation to spark.ml in the future so that we can log more info, but this is good for now. Thanks @MrBago and @ludatabricks ! Merging with master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21344: [SPARK-24114] Add instrumentation to FPGrowth.
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21344 I'll take a look now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21347: [SPARK-24290][ML] add support for Array input for...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21347#discussion_r189057734 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala --- @@ -125,6 +125,19 @@ private[spark] class Instrumentation[E <: Estimator[_]] private ( log(compact(render(name -> value))) } + def logNamedValue(name: String, value: Array[String]): Unit = { +log(compact(render(name -> value.toSeq))) --- End diff -- I see, so you're pointing out that our current approach is inconsistent: JSONifying array values (for clustering) but not JSONifying scalar values (Long, etc.)? I don't have a good sense of what's best, but if we just have to pick something, I'd suggest not JSONifying values since doing so creates slightly longer strings to log (with extra quotes) and since that maintains a stable format from Spark 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r189052646 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -460,18 +461,37 @@ private[ml] trait RandomForestRegressorParams * * Note: Marked as private and DeveloperApi since this may be made public in the future. */ -private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize { +private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize + with HasValidationIndicatorCol { - /* TODO: Add this doc when we add this param. SPARK-7132 - * Threshold for stopping early when runWithValidation is used. + /** + * Threshold for stopping early when fit with validation is used. * If the error rate on the validation input changes by less than the validationTol, --- End diff -- This doc is not quite accurate. Can you please update it to: ``` Threshold for stopping early when fit with validation is used. (This parameter is ignored when fit without validation is used.) The decision to stop early is decided based on this logic: If the current loss on the validation set is greater than 0.01, the diff of validation error is compared to relative tolerance which is validationTol * (current loss on the validation set). If the current loss on the validation set is less than or equal to 0.01, the diff of validation error is compared to absolute tolerance which is validationTol * 0.01. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21090: [SPARK-24026][ML] Add Power Iteration Clustering ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21090#discussion_r188813735 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.clustering + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.clustering.{PowerIterationClustering => MLlibPowerIterationClustering} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types._ + +/** + * Common params for PowerIterationClustering + */ +private[clustering] trait PowerIterationClusteringParams extends Params with HasMaxIter + with HasPredictionCol { + + /** + * The number of clusters to create (k). Must be 1. Default: 2. + * @group param + */ + @Since("2.4.0") + final val k = new IntParam(this, "k", "The number of clusters to create. " + +"Must be > 1.", ParamValidators.gt(1)) + + /** @group getParam */ + @Since("2.4.0") + def getK: Int = $(k) + + /** + * Param for the initialization algorithm. This can be either "random" to use a random vector + * as vertex properties, or "degree" to use a normalized sum of similarities with other vertices. + * Default: random. + * @group expertParam + */ + @Since("2.4.0") + final val initMode = { +val allowedParams = ParamValidators.inArray(Array("random", "degree")) +new Param[String](this, "initMode", "The initialization algorithm. This can be either " + + "'random' to use a random vector as vertex properties, or 'degree' to use a normalized sum " + + "of similarities with other vertices. Supported options: 'random' and 'degree'.", + allowedParams) + } + + /** @group expertGetParam */ + @Since("2.4.0") + def getInitMode: String = $(initMode) + + /** + * Param for the name of the input column for vertex IDs. + * Default: "id" + * @group param + */ + @Since("2.4.0") + val idCol = new Param[String](this, "idCol", "Name of the input column for vertex IDs.", +(value: String) => value.nonEmpty) + + setDefault(idCol, "id") + + /** @group getParam */ + @Since("2.4.0") + def getIdCol: String = getOrDefault(idCol) + + /** + * Param for the name of the input column for neighbors in the adjacency list representation. + * Default: "neighbors" + * @group param + */ + @Since("2.4.0") + val neighborsCol = new Param[String](this, "neighborsCol", +"Name of the input column for neighbors in the adjacency list representation.", +(value: String) => value.nonEmpty) + + setDefault(neighborsCol, "neighbors") + + /** @group getParam */ + @Since("2.4.0") + def getNeighborsCol: String = $(neighborsCol) + + /** + * Param for the name of the input column for neighbors in the adjacency list representation. + * Default: "similarities" + * @group param + */ + @Since("2.4.0") + val similaritiesCol = new Param[String](this, "similaritiesCol", +"Name of the input column for neighbors in the adjacency list representation.", +(value: String) => value.nonEmpty) + --- End diff -- No, it's meant to
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r188813405 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType} + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830;>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining;>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column + *are ignored + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (recommended value: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern + * (recommended value: `10`). + * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the + * internal storage format) allowed in a projected database before + * local processing. If a projected database exceeds this size, another + * iteration of distributed prefix growth is run + * (recommended value: `3200`). + * @return A `DataFrame` that contains columns of sequence and corresponding frequency. + * The schema of it will be: + * - `sequence: Seq[Seq[T]]` (T is the item type) + * - `freq: Long` + */ + @Since("2.4.0") + def findFrequentSequentialPatterns( + dataset: Dataset[_], + sequenceCol: String, --- End diff -- @WeichenXu123 Do you have time to send a PR to update this API? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r188813297 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType} + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830;>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining;>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column + *are ignored + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (recommended value: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern + * (recommended value: `10`). + * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the + * internal storage format) allowed in a projected database before + * local processing. If a projected database exceeds this size, another + * iteration of distributed prefix growth is run + * (recommended value: `3200`). + * @return A `DataFrame` that contains columns of sequence and corresponding frequency. + * The schema of it will be: + * - `sequence: Seq[Seq[T]]` (T is the item type) + * - `freq: Long` + */ + @Since("2.4.0") + def findFrequentSequentialPatterns( + dataset: Dataset[_], + sequenceCol: String, --- End diff -- Oh, I think you're right @mengxr . That approach sounds good. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-22210][ML] Add seed for LDA variationalTopicInference
Repository: spark Updated Branches: refs/heads/master 991726f31 -> bfd75cdfb [SPARK-22210][ML] Add seed for LDA variationalTopicInference ## What changes were proposed in this pull request? - Add seed parameter for variationalTopicInference - Add seed for calling variationalTopicInference in submitMiniBatch - Add var seed in LDAModel so that it can take the seed from LDA and use it for the function call of variationalTopicInference in logLikelihoodBound, topicDistributions, getTopicDistributionMethod, and topicDistribution. ## How was this patch tested? Check the test result in mllib.clustering.LDASuite to make sure the result is repeatable with the seed. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANGCloses #21183 from ludatabricks/SPARK-22210. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bfd75cdf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bfd75cdf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bfd75cdf Branch: refs/heads/master Commit: bfd75cdfb22a8c2fb005da597621e1ccd3990e82 Parents: 991726f Author: Lu WANG Authored: Wed May 16 17:54:06 2018 -0700 Committer: Joseph K. Bradley Committed: Wed May 16 17:54:06 2018 -0700 -- .../org/apache/spark/ml/clustering/LDA.scala| 6 ++- .../spark/mllib/clustering/LDAModel.scala | 34 +--- .../spark/mllib/clustering/LDAOptimizer.scala | 42 +++- .../apache/spark/ml/clustering/LDASuite.scala | 6 +++ 4 files changed, 64 insertions(+), 24 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bfd75cdf/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala b/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala index afe599c..fed42c9 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala @@ -569,10 +569,14 @@ abstract class LDAModel private[ml] ( class LocalLDAModel private[ml] ( uid: String, vocabSize: Int, -@Since("1.6.0") override private[clustering] val oldLocalModel: OldLocalLDAModel, +private[clustering] val oldLocalModel_ : OldLocalLDAModel, sparkSession: SparkSession) extends LDAModel(uid, vocabSize, sparkSession) { + override private[clustering] def oldLocalModel: OldLocalLDAModel = { +oldLocalModel_.setSeed(getSeed) + } + @Since("1.6.0") override def copy(extra: ParamMap): LocalLDAModel = { val copied = new LocalLDAModel(uid, vocabSize, oldLocalModel, sparkSession) http://git-wip-us.apache.org/repos/asf/spark/blob/bfd75cdf/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala index b8a6e94..f915062 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala @@ -32,7 +32,7 @@ import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors} import org.apache.spark.mllib.util.{Loader, Saveable} import org.apache.spark.rdd.RDD import org.apache.spark.sql.{Row, SparkSession} -import org.apache.spark.util.BoundedPriorityQueue +import org.apache.spark.util.{BoundedPriorityQueue, Utils} /** * Latent Dirichlet Allocation (LDA) model. @@ -194,6 +194,8 @@ class LocalLDAModel private[spark] ( override protected[spark] val gammaShape: Double = 100) extends LDAModel with Serializable { + private var seed: Long = Utils.random.nextLong() + @Since("1.3.0") override def k: Int = topics.numCols @@ -216,6 +218,21 @@ class LocalLDAModel private[spark] ( override protected def formatVersion = "1.0" + /** + * Random seed for cluster initialization. + */ + @Since("2.4.0") + def getSeed: Long = seed + + /** + * Set the random seed for cluster initialization. + */ + @Since("2.4.0") + def setSeed(seed: Long): this.type = { +this.seed = seed +this + } + @Since("1.5.0") override def save(sc: SparkContext, path: String): Unit = { LocalLDAModel.SaveLoadV1_0.save(sc, path, topicsMatrix, docConcentration, topicConcentration, @@ -298,6 +315,7 @@ class LocalLDAModel private[spark] ( // by topic (columns of lambda) val Elogbeta = LDAUtils.dirichletExpectation(lambda.t).t val ElogbetaBc = documents.sparkContext.broadcast(Elogbeta) +val
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21183 Thanks for checking this manually. Since the test sometimes fails, then let's leave it. LGTM Merging with master Thanks @ludatabricks and @mengxr ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-24058][ML][PYSPARK] Default Params in ML should be saved separately: Python API
Repository: spark Updated Branches: refs/heads/master 6b94420f6 -> 8a13c5096 [SPARK-24058][ML][PYSPARK] Default Params in ML should be saved separately: Python API ## What changes were proposed in this pull request? See SPARK-23455 for reference. Now default params in ML are saved separately in metadata file in Scala. We must change it for Python for Spark 2.4.0 as well in order to keep them in sync. ## How was this patch tested? Added test. Author: Liang-Chi HsiehCloses #21153 from viirya/SPARK-24058. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8a13c509 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8a13c509 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8a13c509 Branch: refs/heads/master Commit: 8a13c5096898f95d1dfcedaf5d31205a1cbf0a19 Parents: 6b94420 Author: Liang-Chi Hsieh Authored: Tue May 15 16:50:09 2018 -0700 Committer: Joseph K. Bradley Committed: Tue May 15 16:50:09 2018 -0700 -- python/pyspark/ml/tests.py | 38 ++ python/pyspark/ml/util.py | 30 -- 2 files changed, 66 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8a13c509/python/pyspark/ml/tests.py -- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index 0935931..0dde0db 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -1595,6 +1595,44 @@ class PersistenceTest(SparkSessionTestCase): self.assertEqual(lr.uid, lr3.uid) self.assertEqual(lr.extractParamMap(), lr3.extractParamMap()) +def test_default_read_write_default_params(self): +lr = LogisticRegression() +self.assertFalse(lr.isSet(lr.getParam("threshold"))) + +lr.setMaxIter(50) +lr.setThreshold(.75) + +# `threshold` is set by user, default param `predictionCol` is not set by user. +self.assertTrue(lr.isSet(lr.getParam("threshold"))) +self.assertFalse(lr.isSet(lr.getParam("predictionCol"))) +self.assertTrue(lr.hasDefault(lr.getParam("predictionCol"))) + +writer = DefaultParamsWriter(lr) +metadata = json.loads(writer._get_metadata_to_save(lr, self.sc)) +self.assertTrue("defaultParamMap" in metadata) + +reader = DefaultParamsReadable.read() +metadataStr = json.dumps(metadata, separators=[',', ':']) +loadedMetadata = reader._parseMetaData(metadataStr, ) +reader.getAndSetParams(lr, loadedMetadata) + +self.assertTrue(lr.isSet(lr.getParam("threshold"))) +self.assertFalse(lr.isSet(lr.getParam("predictionCol"))) +self.assertTrue(lr.hasDefault(lr.getParam("predictionCol"))) + +# manually create metadata without `defaultParamMap` section. +del metadata['defaultParamMap'] +metadataStr = json.dumps(metadata, separators=[',', ':']) +loadedMetadata = reader._parseMetaData(metadataStr, ) +with self.assertRaisesRegexp(AssertionError, "`defaultParamMap` section not found"): +reader.getAndSetParams(lr, loadedMetadata) + +# Prior to 2.4.0, metadata doesn't have `defaultParamMap`. +metadata['sparkVersion'] = '2.3.0' +metadataStr = json.dumps(metadata, separators=[',', ':']) +loadedMetadata = reader._parseMetaData(metadataStr, ) +reader.getAndSetParams(lr, loadedMetadata) + class LDATest(SparkSessionTestCase): http://git-wip-us.apache.org/repos/asf/spark/blob/8a13c509/python/pyspark/ml/util.py -- diff --git a/python/pyspark/ml/util.py b/python/pyspark/ml/util.py index a486c6a..9fa8566 100644 --- a/python/pyspark/ml/util.py +++ b/python/pyspark/ml/util.py @@ -30,6 +30,7 @@ if sys.version > '3': from pyspark import SparkContext, since from pyspark.ml.common import inherit_doc from pyspark.sql import SparkSession +from pyspark.util import VersionUtils def _jvm(): @@ -396,6 +397,7 @@ class DefaultParamsWriter(MLWriter): - sparkVersion - uid - paramMap +- defaultParamMap (since 2.4.0) - (optionally, extra metadata) :param extraMetadata: Extra metadata to be saved at same level as uid, paramMap, etc. :param paramMap: If given, this is saved in the "paramMap" field. @@ -417,15 +419,24 @@ class DefaultParamsWriter(MLWriter): """ uid = instance.uid cls = instance.__module__ + '.' + instance.__class__.__name__ -params = instance.extractParamMap() + +# User-supplied param values +params = instance._paramMap jsonParams = {} if paramMap
[GitHub] spark issue #21153: [SPARK-24058][ML][PySpark] Default Params in ML should b...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21153 OK thanks @viirya ! Merging with master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21153: [SPARK-24058][ML][PySpark] Default Params in ML s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21153#discussion_r187129517 --- Diff: python/pyspark/ml/util.py --- @@ -396,6 +397,7 @@ def saveMetadata(instance, path, sc, extraMetadata=None, paramMap=None): - sparkVersion - uid - paramMap +- defalutParamMap (since 2.4.0) --- End diff -- typo: default --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21183#discussion_r188081089 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -473,7 +475,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging { None } -val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitions { docs => +val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitionsWithIndex --- End diff -- fix scala style: ``` val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitionsWithIndex { (index, docs) => val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0) val stat = BDM.zeros[Double](k, vocabSize) val logphatPartOption = logphatPartOptionBase() var nonEmptyDocCount: Long = 0L nonEmptyDocs.foreach { case (_, termCounts: Vector) => nonEmptyDocCount += 1 val (gammad, sstats, ids) = OnlineLDAOptimizer.variationalTopicInference( termCounts, expElogbetaBc.value, alpha, gammaShape, k, seed + index) stat(::, ids) := stat(::, ids) + sstats logphatPartOption.foreach(_ += LDAUtils.dirichletExpectation(gammad)) } Iterator((stat, logphatPartOption, nonEmptyDocCount)) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21153: [SPARK-24058][ML][PySpark] Default Params in ML should b...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21153 Would you mind rebasing this off of the upstream master branch? I'm having trouble running the tests for this PR locally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21274: [SPARK-24213][ML] Fix for Int id type for PowerIt...
Github user jkbradley closed the pull request at: https://github.com/apache/spark/pull/21274 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21274: [SPARK-24213][ML] Fix for Int id type for PowerIteration...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21274 This issue actually brings up a problem with the Transformer approach for PIC. Just commented more here: https://issues.apache.org/jira/browse/SPARK-15784 Thank you for pushing back! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21119: [SPARK-19826][ML][PYTHON]add spark.ml Python API for PIC
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21119 I think we messed up with the original PIC API. Could you please check out my comment here https://issues.apache.org/jira/browse/SPARK-15784 ? If others agree, I'll revert the Scala API and we can work on adding a modified version. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21183#discussion_r187217859 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -622,11 +623,11 @@ object LocalLDAModel extends MLReadable[LocalLDAModel] { val vectorConverted = MLUtils.convertVectorColumnsToML(data, "docConcentration") val matrixConverted = MLUtils.convertMatrixColumnsToML(vectorConverted, "topicsMatrix") val Row(vocabSize: Int, topicsMatrix: Matrix, docConcentration: Vector, - topicConcentration: Double, gammaShape: Double) = + topicConcentration: Double, gammaShape: Double, seed: Long) = --- End diff -- This will break backwards compatibility of ML persistence (when users try to load LDAModels saved using past versions of Spark). Could you please test this manually by saving a LocalLDAModel using Spark 2.3 and loading it with a build of your PR? You can fix this by checking for the Spark version (in the `metadata`) and loading the seed for Spark >= 2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21183#discussion_r187216371 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala --- @@ -252,6 +252,15 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext with DefaultRead val lda = new LDA() testEstimatorAndModelReadWrite(lda, dataset, LDASuite.allParamSettings, LDASuite.allParamSettings, checkModelData) + +def checkModelDataWithDataset(model: LDAModel, model2: LDAModel, --- End diff -- style: Please fix this to match other multi-line method headers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21265: [SPARK-24146][PySpark][ML] spark.ml parity for sequentia...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21265 Let's wait on this until we make the decision in the last thread in https://github.com/apache/spark/pull/20973 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r187216080 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType} + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830;>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining;>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column + *are ignored + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (recommended value: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern + * (recommended value: `10`). + * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the + * internal storage format) allowed in a projected database before + * local processing. If a projected database exceeds this size, another + * iteration of distributed prefix growth is run + * (recommended value: `3200`). + * @return A `DataFrame` that contains columns of sequence and corresponding frequency. + * The schema of it will be: + * - `sequence: Seq[Seq[T]]` (T is the item type) + * - `freq: Long` + */ + @Since("2.4.0") + def findFrequentSequentialPatterns( + dataset: Dataset[_], + sequenceCol: String, --- End diff -- I agree in general, but I don’t think it’s a big deal for PrefixSpan. I think of our current static method as a temporary workaround until we do the work to build a Model which can make meaningful predictions. This will mean that further PrefixSpan improvements may be blocked on this Model work, but I think that’s OK since predictions should be the next priority for PrefixSpan. Once we have a Model, I recommend we deprecate the current static method. I'm also OK with changing this to use setters, but then we should name it something else so that we can replace it with an Estimator + Model pair later on. I'd suggest "PrefixSpanBuilder." --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21265#discussion_r187187330 --- Diff: python/pyspark/ml/fpm.py --- @@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", def _create_model(self, java_model): return FPGrowthModel(java_model) + + +class PrefixSpan(object): +""" +.. note:: Experimental + +A parallel PrefixSpan algorithm to mine frequent sequential patterns. +The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns +Efficiently by Prefix-Projected Pattern Growth +(see http://doi.org/10.1109/ICDE.2001.914830;>here). + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def findFrequentSequentialPatterns(dataset, + sequenceCol, + minSupport, + maxPatternLength, + maxLocalProjDBSize): +""" +.. note:: Experimental +Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + +:param dataset: A dataset or a dataframe containing a sequence column which is +`Seq[Seq[_]]` type. +:param sequenceCol: The name of the sequence column in dataset, rows with nulls in this +column are ignored. +:param minSupport: The minimal support level of the sequential pattern, any pattern that + appears more than (minSupport * size-of-the-dataset) times will be + output (recommended value: `0.1`). +:param maxPatternLength: The maximal length of the sequential pattern + (recommended value: `10`). +:param maxLocalProjDBSize: The maximum number of items (including delimiters used in the + internal storage format) allowed in a projected database before + local processing. If a projected database exceeds this size, + another iteration of distributed prefix growth is run + (recommended value: `3200`). +:return: A `DataFrame` that contains columns of sequence and corresponding frequency. + The schema of it will be: + - `sequence: Seq[Seq[T]]` (T is the item type) + - `freq: Long` + +>>> from pyspark.ml.fpm import PrefixSpan +>>> from pyspark.sql import Row +>>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]), --- End diff -- My 2 cents: That sounds like a judgement call: If it's to explain the behavior more clearly, then that sounds reasonable. I feel like it's pretty clear how nulls are treated from the doc. maxPatternLength might benefit from an example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21203: [SPARK-24131][PySpark] Add majorMinorVersion API to PySp...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21203 It's OK but would you mind fixing it @viirya before we use it in https://github.com/apache/spark/pull/21153 ? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs
Repository: spark Updated Branches: refs/heads/master 628c7b517 -> 7aaa148f5 [SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs ## What changes were proposed in this pull request? Provide evaluateEachIteration method or equivalent for spark.ml GBTs. ## How was this patch tested? UT. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: WeichenXuCloses #21097 from WeichenXu123/GBTeval. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7aaa148f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7aaa148f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7aaa148f Branch: refs/heads/master Commit: 7aaa148f593470b2c32221b69097b8b54524eb74 Parents: 628c7b5 Author: WeichenXu Authored: Wed May 9 11:09:19 2018 -0700 Committer: Joseph K. Bradley Committed: Wed May 9 11:09:19 2018 -0700 -- .../spark/ml/classification/GBTClassifier.scala | 15 + .../spark/ml/regression/GBTRegressor.scala | 17 ++- .../org/apache/spark/ml/tree/treeParams.scala | 6 +++- .../ml/classification/GBTClassifierSuite.scala | 29 +- .../spark/ml/regression/GBTRegressorSuite.scala | 32 ++-- 5 files changed, 94 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7aaa148f/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala index 0aa24f0..3fb6d1e 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala @@ -334,6 +334,21 @@ class GBTClassificationModel private[ml]( // hard coded loss, which is not meant to be changed in the model private val loss = getOldLossType + /** + * Method to compute error or loss for every iteration of gradient boosting. + * + * @param dataset Dataset for validation. + */ + @Since("2.4.0") + def evaluateEachIteration(dataset: Dataset[_]): Array[Double] = { +val data = dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map { + case Row(label: Double, features: Vector) => LabeledPoint(label, features) +} +GradientBoostedTrees.evaluateEachIteration(data, trees, treeWeights, loss, + OldAlgo.Classification +) + } + @Since("2.0.0") override def write: MLWriter = new GBTClassificationModel.GBTClassificationModelWriter(this) } http://git-wip-us.apache.org/repos/asf/spark/blob/7aaa148f/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala index 8598e80..d7e054b 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala @@ -34,7 +34,7 @@ import org.apache.spark.ml.util.DefaultParamsReader.Metadata import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo} import org.apache.spark.mllib.tree.model.{GradientBoostedTreesModel => OldGBTModel} import org.apache.spark.rdd.RDD -import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.{DataFrame, Dataset, Row} import org.apache.spark.sql.functions._ /** @@ -269,6 +269,21 @@ class GBTRegressionModel private[ml]( new OldGBTModel(OldAlgo.Regression, _trees.map(_.toOld), _treeWeights) } + /** + * Method to compute error or loss for every iteration of gradient boosting. + * + * @param dataset Dataset for validation. + * @param loss The loss function used to compute error. Supported options: squared, absolute + */ + @Since("2.4.0") + def evaluateEachIteration(dataset: Dataset[_], loss: String): Array[Double] = { +val data = dataset.select(col($(labelCol)), col($(featuresCol))).rdd.map { + case Row(label: Double, features: Vector) => LabeledPoint(label, features) +} +GradientBoostedTrees.evaluateEachIteration(data, trees, treeWeights, + convertToOldLossType(loss), OldAlgo.Regression) + } + @Since("2.0.0") override def write: MLWriter = new GBTRegressionModel.GBTRegressionModelWriter(this) } http://git-wip-us.apache.org/repos/asf/spark/blob/7aaa148f/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala
[GitHub] spark issue #21097: [SPARK-14682][ML] Provide evaluateEachIteration method o...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21097 LGTM Merging with master Thanks @WeichenXu123 ! Would you mind creating & linking a JIRA for the Python API update? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21274: [SPARK-24213][ML] Fix for Int id type for PowerIteration...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21274 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [MINOR][ML][DOC] Improved Naive Bayes user guide explanation
Repository: spark Updated Branches: refs/heads/master 6ea582e36 -> 94155d039 [MINOR][ML][DOC] Improved Naive Bayes user guide explanation ## What changes were proposed in this pull request? This copies the material from the spark.mllib user guide page for Naive Bayes to the spark.ml user guide page. I also improved the wording and organization slightly. ## How was this patch tested? Built docs locally. Author: Joseph K. Bradley <jos...@databricks.com> Closes #21272 from jkbradley/nb-doc-update. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/94155d03 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/94155d03 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/94155d03 Branch: refs/heads/master Commit: 94155d0395324a012db2fc8a57edb3cd90b61e96 Parents: 6ea582e Author: Joseph K. Bradley <jos...@databricks.com> Authored: Wed May 9 10:34:57 2018 -0700 Committer: Joseph K. Bradley <jos...@databricks.com> Committed: Wed May 9 10:34:57 2018 -0700 -- docs/ml-classification-regression.md | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/94155d03/docs/ml-classification-regression.md -- diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md index d660655..b3d1090 100644 --- a/docs/ml-classification-regression.md +++ b/docs/ml-classification-regression.md @@ -455,11 +455,29 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat ## Naive Bayes [Naive Bayes classifiers](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) are a family of simple -probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence -assumptions between the features. The `spark.ml` implementation currently supports both [multinomial -naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html) +probabilistic, multiclass classifiers based on applying Bayes' theorem with strong (naive) independence +assumptions between every pair of features. + +Naive Bayes can be trained very efficiently. With a single pass over the training data, +it computes the conditional probability distribution of each feature given each label. +For prediction, it applies Bayes' theorem to compute the conditional probability distribution +of each label given an observation. + +MLlib supports both [multinomial naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes) and [Bernoulli naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html). -More information can be found in the section on [Naive Bayes in MLlib](mllib-naive-bayes.html#naive-bayes-sparkmllib). + +*Input data*: +These models are typically used for [document classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html). +Within that context, each observation is a document and each feature represents a term. +A feature's value is the frequency of the term (in multinomial Naive Bayes) or +a zero or one indicating whether the term was found in the document (in Bernoulli Naive Bayes). +Feature values must be *non-negative*. The model type is selected with an optional parameter +"multinomial" or "bernoulli" with "multinomial" as the default. +For document classification, the input feature vectors should usually be sparse vectors. +Since the training data is only used once, it is not necessary to cache it. + +[Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by +setting the parameter $\lambda$ (default to $1.0$). **Examples** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #21272: [MINOR][ML][DOC] Improved Naive Bayes user guide explana...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21272 Thanks for the LGTM! Merging with master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r187114172 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams * * Note: Marked as private and DeveloperApi since this may be made public in the future. */ -private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize { +private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize + with HasValidationIndicatorCol { - /* TODO: Add this doc when we add this param. SPARK-7132 - * Threshold for stopping early when runWithValidation is used. + /** + * Threshold for stopping early when fit with validation is used. * If the error rate on the validation input changes by less than the validationTol, --- End diff -- Let's add the more precise description from the old BoostingStrategy for this Param. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r187113953 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams * * Note: Marked as private and DeveloperApi since this may be made public in the future. */ -private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize { +private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize + with HasValidationIndicatorCol { - /* TODO: Add this doc when we add this param. SPARK-7132 - * Threshold for stopping early when runWithValidation is used. + /** + * Threshold for stopping early when fit with validation is used. * If the error rate on the validation input changes by less than the validationTol, - * then learning will stop early (before [[numIterations]]). - * This parameter is ignored when run is used. + * then learning will stop early (before [[maxIter]]). + * This parameter is ignored when fit without validation is used. * (default = 1e-5) * @group param --- End diff -- Let's add a `@see` validationIndicatorCol --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r187112582 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams * * Note: Marked as private and DeveloperApi since this may be made public in the future. */ -private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize { +private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize + with HasValidationIndicatorCol { - /* TODO: Add this doc when we add this param. SPARK-7132 - * Threshold for stopping early when runWithValidation is used. + /** + * Threshold for stopping early when fit with validation is used. * If the error rate on the validation input changes by less than the validationTol, - * then learning will stop early (before [[numIterations]]). - * This parameter is ignored when run is used. + * then learning will stop early (before [[maxIter]]). + * This parameter is ignored when fit without validation is used. * (default = 1e-5) --- End diff -- I forget why we chose 1e-5 (which is different from spark.mllib). What do you think about using 0.01 to match the sklearn docs here? http://scikit-learn.org/dev/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html (I also checked xgboost, but they use a different approach based on x number of steps without improvement. We may want to add that at some point since it sounds more robust.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21218: [SPARK-24155][ML] Instrumentation improvements fo...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21218#discussion_r187115704 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala --- @@ -278,6 +279,7 @@ class BisectingKMeans @Since("2.0.0") ( val summary = new BisectingKMeansSummary( model.transform(dataset), $(predictionCol), $(featuresCol), $(k)) model.setSummary(Some(summary)) +instr.logNamedValue("clusterSizes", summary.clusterSizes.mkString(", ")) --- End diff -- We log Params in JSON format, but we don't have a precedent for logging Array values. I'd be OK with ducking the issue by not logging this value, or with logging it as JSON. If the latter, then we could add a method Instrumentation.logNamedValue taking Array types. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21270: [SPARK-24213][ML]Power Iteration Clustering in SparkML t...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21270 Thanks for the patch! I just commented on https://issues.apache.org/jira/browse/SPARK-24213 though and would like to replace this with https://github.com/apache/spark/pull/21274 Could you please close this issue and help with reviewing the other PR? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21274: [SPARK-24213][ML] Fix for Int id type for PowerIt...
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/21274 [SPARK-24213][ML] Fix for Int id type for PowerIterationClustering in spark.ml ## What changes were proposed in this pull request? PIC in spark.ml has tests for "id" type IntegerType, but those tests did not fully check the result. This patch: * fixes the unit test to break for the existing implementation * fixes the implementation ## How was this patch tested? Existing unit tests, with fixes You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark SPARK-24213 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21274.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21274 commit cd02df5a995ac7daa2e76dcc13c54b987265b6e4 Author: Joseph K. Bradley <joseph@...> Date: 2018-05-08T22:02:33Z fix for Int id type for PowerIterationClustering in spark.ml --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r186865863 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -367,11 +367,31 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { test("model evaluateEachIteration") { val gbt = new GBTClassifier() + .setSeed(1L) .setMaxDepth(2) - .setMaxIter(2) -val model = gbt.fit(trainData.toDF) -val eval = model.evaluateEachIteration(validationData.toDF) -assert(Vectors.dense(eval) ~== Vectors.dense(1.7641, 1.8209) relTol 1E-3) + .setMaxIter(3) + .setLossType("logistic") +val model3 = gbt.fit(trainData.toDF) +val model1 = new GBTClassificationModel("gbt-cls-model-test1", + model3.trees.take(1), model3.treeWeights.take(1), model3.numFeatures, model3.numClasses) +val model2 = new GBTClassificationModel("gbt-cls-model-test2", + model3.trees.take(2), model3.treeWeights.take(2), model3.numFeatures, model3.numClasses) + +for (evalLossType <- GBTClassifier.supportedLossTypes) { --- End diff -- evalLossType is not used, so I'd remove this loop. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21272: [MINOR][ML][DOC] Improved Naive Bayes user guide explana...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21272 ![screen shot 2018-05-08 at 2 03 13 pm](https://user-images.githubusercontent.com/5084283/39783013-a1650846-52c8-11e8-8f15-42b93dd51168.png) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21272: [MINOR][ML][DOC] Improved Naive Bayes user guide ...
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/21272 [MINOR][ML][DOC] Improved Naive Bayes user guide explanation ## What changes were proposed in this pull request? This copies the material from the spark.mllib user guide page for Naive Bayes to the spark.ml user guide page. I also improved the wording and organization slightly. ## How was this patch tested? Built docs locally. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark nb-doc-update Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21272.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21272 commit 987448d58e58748d8f52b9b6b2036964e8111a15 Author: Joseph K. Bradley <joseph@...> Date: 2018-05-08T20:38:37Z improved Naive Bayes user guide --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r186572477 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -497,6 +498,9 @@ private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasS @deprecated("This method is deprecated and will be removed in 3.0.0.", "2.1.0") def setStepSize(value: Double): this.type = set(stepSize, value) + /** @group setParam */ + def setValidationIndicatorCol(value: String): this.type = set(validationIndicatorCol, value) --- End diff -- Since we need this setter to be in the concrete classes for Java compatibility, don't bother putting it here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r186572374 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -365,6 +366,50 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { assert(mostImportantFeature !== mostIF) } + test("runWithValidation stops early and performs better on a validation dataset") { +val validationIndicatorCol = "validationIndicator" +val trainDF = trainData.toDF().withColumn(validationIndicatorCol, lit(false)) +val validationDF = validationData.toDF().withColumn(validationIndicatorCol, lit(true)) + +val numIter = 20 +for (lossType <- GBTClassifier.supportedLossTypes) { + val gbt = new GBTClassifier() +.setSeed(123) +.setMaxDepth(2) +.setLossType(lossType) +.setMaxIter(numIter) + val modelWithoutValidation = gbt.fit(trainDF) + + gbt.setValidationIndicatorCol(validationIndicatorCol) + val modelWithValidation = gbt.fit(trainDF.union(validationDF)) + + // early stop + assert(modelWithValidation.numTrees < numIter) + + val (errorWithoutValidation, errorWithValidation) = { +val remappedRdd = validationData.map(x => new LabeledPoint(2 * x.label - 1, x.features)) +(GradientBoostedTrees.computeError(remappedRdd, modelWithoutValidation.trees, + modelWithoutValidation.treeWeights, modelWithoutValidation.getOldLossType), + GradientBoostedTrees.computeError(remappedRdd, modelWithValidation.trees, +modelWithValidation.treeWeights, modelWithValidation.getOldLossType)) + } + assert(errorWithValidation <= errorWithoutValidation) --- End diff -- It'd be nice to have this be strictly true. Is it not? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r186573136 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -146,20 +146,40 @@ class GBTClassifier @Since("1.4.0") ( @Since("1.4.0") def setLossType(value: String): this.type = set(lossType, value) + /** @group setParam */ + @Since("2.4.0") + override def setValidationIndicatorCol(value: String): this.type = { +set(validationIndicatorCol, value) + } + override protected def train(dataset: Dataset[_]): GBTClassificationModel = { val categoricalFeatures: Map[Int, Int] = MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) + +val withValidation = isDefined(validationIndicatorCol) + // We copy and modify this from Classifier.extractLabeledPoints since GBT only supports // 2 classes now. This lets us provide a more precise error message. -val oldDataset: RDD[LabeledPoint] = +val convert2LabelPoint = (dataset: Dataset[_]) => { --- End diff -- nit: "LabelPoint" -> "LabeledPoint" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r186569928 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -365,6 +366,50 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { assert(mostImportantFeature !== mostIF) } + test("runWithValidation stops early and performs better on a validation dataset") { +val validationIndicatorCol = "validationIndicator" +val trainDF = trainData.toDF().withColumn(validationIndicatorCol, lit(false)) +val validationDF = validationData.toDF().withColumn(validationIndicatorCol, lit(true)) + +val numIter = 20 +for (lossType <- GBTClassifier.supportedLossTypes) { + val gbt = new GBTClassifier() +.setSeed(123) +.setMaxDepth(2) +.setLossType(lossType) +.setMaxIter(numIter) + val modelWithoutValidation = gbt.fit(trainDF) + + gbt.setValidationIndicatorCol(validationIndicatorCol) + val modelWithValidation = gbt.fit(trainDF.union(validationDF)) + + // early stop + assert(modelWithValidation.numTrees < numIter) --- End diff -- Let's also assert `modelWithoutValidation.numTrees == numIter`. That's true now, but I could imagine it changing later on if we add a convergence tolerance to the algorithm. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r186572853 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -146,20 +146,40 @@ class GBTClassifier @Since("1.4.0") ( @Since("1.4.0") def setLossType(value: String): this.type = set(lossType, value) + /** @group setParam */ + @Since("2.4.0") + override def setValidationIndicatorCol(value: String): this.type = { +set(validationIndicatorCol, value) + } + override protected def train(dataset: Dataset[_]): GBTClassificationModel = { val categoricalFeatures: Map[Int, Int] = MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) + +val withValidation = isDefined(validationIndicatorCol) --- End diff -- Let's do: `isDefined(validationIndicatorCol) and $(validationIndicatorCol).nonEmpty` (so we count "" as not defined). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r186565756 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala --- @@ -269,6 +269,20 @@ class GBTRegressionModel private[ml]( new OldGBTModel(OldAlgo.Regression, _trees.map(_.toOld), _treeWeights) } + /** + * Method to compute error or loss for every iteration of gradient boosting. + * + * @param dataset Dataset for validation. --- End diff -- Add doc for "loss" arg, including what the options are --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan
Repository: spark Updated Branches: refs/heads/master f48bd6bdc -> 76ecd0950 [SPARK-20114][ML] spark.ml parity for sequential pattern mining - PrefixSpan ## What changes were proposed in this pull request? PrefixSpan API for spark.ml. New implementation instead of #20810 ## How was this patch tested? TestSuite added. Author: WeichenXuCloses #20973 from WeichenXu123/prefixSpan2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76ecd095 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76ecd095 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76ecd095 Branch: refs/heads/master Commit: 76ecd095024a658bf68e5db658e4416565b30c17 Parents: f48bd6b Author: WeichenXu Authored: Mon May 7 14:57:14 2018 -0700 Committer: Joseph K. Bradley Committed: Mon May 7 14:57:14 2018 -0700 -- .../org/apache/spark/ml/fpm/PrefixSpan.scala| 96 + .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 3 +- .../apache/spark/ml/fpm/PrefixSpanSuite.scala | 136 +++ 3 files changed, 233 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/76ecd095/mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala new file mode 100644 index 000..02168fe --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{ArrayType, LongType, StructField, StructType} + +/** + * :: Experimental :: + * A parallel PrefixSpan algorithm to mine frequent sequential patterns. + * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns + * Efficiently by Prefix-Projected Pattern Growth + * (see http://doi.org/10.1109/ICDE.2001.914830;>here). + * + * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining;>Sequential Pattern Mining + * (Wikipedia) + */ +@Since("2.4.0") +@Experimental +object PrefixSpan { + + /** + * :: Experimental :: + * Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + * + * @param dataset A dataset or a dataframe containing a sequence column which is + *{{{Seq[Seq[_]]}}} type + * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column + *are ignored + * @param minSupport the minimal support level of the sequential pattern, any pattern that + * appears more than (minSupport * size-of-the-dataset) times will be output + * (recommended value: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern + * (recommended value: `10`). + * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the + * internal storage format) allowed in a projected database before + * local processing. If a projected database exceeds this size, another + * iteration of distributed prefix growth is run + * (recommended value: `3200`). + * @return A `DataFrame` that contains columns of sequence and corresponding frequency. + * The schema of it will be: + * - `sequence: Seq[Seq[T]]` (T is the item type) + * - `freq: Long` + */ + @Since("2.4.0") + def findFrequentSequentialPatterns( + dataset:
[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20973 Merging with master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20261: [SPARK-22885][ML][TEST] ML test for StructuredStreaming:...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20261 Merging with master Thanks @WeichenXu123 ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-22885][ML][TEST] ML test for StructuredStreaming: spark.ml.tuning
Repository: spark Updated Branches: refs/heads/master 1c9c5de95 -> f48bd6bdc [SPARK-22885][ML][TEST] ML test for StructuredStreaming: spark.ml.tuning ## What changes were proposed in this pull request? ML test for StructuredStreaming: spark.ml.tuning ## How was this patch tested? N/A Author: WeichenXuCloses #20261 from WeichenXu123/ml_stream_tuning_test. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f48bd6bd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f48bd6bd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f48bd6bd Branch: refs/heads/master Commit: f48bd6bdc5aefd9ec43e2d0ee648d17add7ef554 Parents: 1c9c5de Author: WeichenXu Authored: Mon May 7 14:55:41 2018 -0700 Committer: Joseph K. Bradley Committed: Mon May 7 14:55:41 2018 -0700 -- .../apache/spark/ml/tuning/CrossValidatorSuite.scala | 15 +++ .../spark/ml/tuning/TrainValidationSplitSuite.scala | 15 +++ 2 files changed, 22 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f48bd6bd/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala index 15dade2..e6ee722 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala @@ -25,17 +25,17 @@ import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressio import org.apache.spark.ml.classification.LogisticRegressionSuite.generateLogisticInput import org.apache.spark.ml.evaluation.{BinaryClassificationEvaluator, Evaluator, MulticlassClassificationEvaluator, RegressionEvaluator} import org.apache.spark.ml.feature.HashingTF -import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.param.ParamMap import org.apache.spark.ml.param.shared.HasInputCol import org.apache.spark.ml.regression.LinearRegression -import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable, MLTestingUtils} -import org.apache.spark.mllib.util.{LinearDataGenerator, MLlibTestSparkContext} +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest, MLTestingUtils} +import org.apache.spark.mllib.util.LinearDataGenerator import org.apache.spark.sql.Dataset import org.apache.spark.sql.types.StructType class CrossValidatorSuite - extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + extends MLTest with DefaultReadWriteTest { import testImplicits._ @@ -66,6 +66,13 @@ class CrossValidatorSuite assert(parent.getRegParam === 0.001) assert(parent.getMaxIter === 10) assert(cvModel.avgMetrics.length === lrParamMaps.length) + +val result = cvModel.transform(dataset).select("prediction").as[Double].collect() +testTransformerByGlobalCheckFunc[(Double, Vector)](dataset.toDF(), cvModel, "prediction") { + rows => +val result2 = rows.map(_.getDouble(0)) +assert(result === result2) +} } test("cross validation with linear regression") { http://git-wip-us.apache.org/repos/asf/spark/blob/f48bd6bd/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala index 9024342..cd76acf 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala @@ -24,17 +24,17 @@ import org.apache.spark.ml.{Estimator, Model} import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel, OneVsRest} import org.apache.spark.ml.classification.LogisticRegressionSuite.generateLogisticInput import org.apache.spark.ml.evaluation.{BinaryClassificationEvaluator, Evaluator, RegressionEvaluator} -import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.param.ParamMap import org.apache.spark.ml.param.shared.HasInputCol import org.apache.spark.ml.regression.LinearRegression -import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable, MLTestingUtils} -import org.apache.spark.mllib.util.{LinearDataGenerator, MLlibTestSparkContext} +import org.apache.spark.ml.util.{DefaultReadWriteTest,
spark git commit: [SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions specified in pyspark
Repository: spark Updated Branches: refs/heads/master d83e96372 -> 56a52e0a5 [SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions specified in pyspark ## What changes were proposed in this pull request? Change FPGrowth from private to private[spark]. If no numPartitions is specified, then default value -1 is used. But -1 is only valid in the construction function of FPGrowth, but not in setNumPartitions. So I make this change and use the constructor directly rather than using set method. ## How was this patch tested? Unit test is added Author: Jeff ZhangCloses #13493 from zjffdu/SPARK-15750. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/56a52e0a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/56a52e0a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/56a52e0a Branch: refs/heads/master Commit: 56a52e0a58fc82ea69e47d0d8c4f905565be7c8b Parents: d83e963 Author: Jeff Zhang Authored: Mon May 7 14:47:58 2018 -0700 Committer: Joseph K. Bradley Committed: Mon May 7 14:47:58 2018 -0700 -- .../apache/spark/mllib/api/python/PythonMLLibAPI.scala | 5 + .../scala/org/apache/spark/mllib/fpm/FPGrowth.scala | 2 +- python/pyspark/mllib/tests.py | 12 3 files changed, 14 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/56a52e0a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala index b32d3f2..db3f074 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala @@ -572,10 +572,7 @@ private[python] class PythonMLLibAPI extends Serializable { data: JavaRDD[java.lang.Iterable[Any]], minSupport: Double, numPartitions: Int): FPGrowthModel[Any] = { -val fpg = new FPGrowth() - .setMinSupport(minSupport) - .setNumPartitions(numPartitions) - +val fpg = new FPGrowth(minSupport, numPartitions) val model = fpg.run(data.rdd.map(_.asScala.toArray)) new FPGrowthModelWrapper(model) } http://git-wip-us.apache.org/repos/asf/spark/blob/56a52e0a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala index f6b1143..4f2b7e6 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala @@ -162,7 +162,7 @@ object FPGrowthModel extends Loader[FPGrowthModel[_]] { * */ @Since("1.3.0") -class FPGrowth private ( +class FPGrowth private[spark] ( private var minSupport: Double, private var numPartitions: Int) extends Logging with Serializable { http://git-wip-us.apache.org/repos/asf/spark/blob/56a52e0a/python/pyspark/mllib/tests.py -- diff --git a/python/pyspark/mllib/tests.py b/python/pyspark/mllib/tests.py index 14d788b..4c2ce13 100644 --- a/python/pyspark/mllib/tests.py +++ b/python/pyspark/mllib/tests.py @@ -57,6 +57,7 @@ from pyspark.mllib.linalg import Vector, SparseVector, DenseVector, VectorUDT, _ DenseMatrix, SparseMatrix, Vectors, Matrices, MatrixUDT from pyspark.mllib.linalg.distributed import RowMatrix from pyspark.mllib.classification import StreamingLogisticRegressionWithSGD +from pyspark.mllib.fpm import FPGrowth from pyspark.mllib.recommendation import Rating from pyspark.mllib.regression import LabeledPoint, StreamingLinearRegressionWithSGD from pyspark.mllib.random import RandomRDDs @@ -1762,6 +1763,17 @@ class DimensionalityReductionTests(MLlibTestCase): self.assertEqualUpToSign(pcs.toArray()[:, k - 1], expected_pcs[:, k - 1]) +class FPGrowthTest(MLlibTestCase): + +def test_fpgrowth(self): +data = [["a", "b", "c"], ["a", "b", "d", "e"], ["a", "c", "e"], ["a", "c", "f"]] +rdd = self.sc.parallelize(data, 2) +model1 = FPGrowth.train(rdd, 0.6, 2) +# use default data partition number when numPartitions is not specified +model2 = FPGrowth.train(rdd, 0.6) +self.assertEqual(sorted(model1.freqItemsets().collect()), + sorted(model2.freqItemsets().collect())) + if __name__ == "__main__": from pyspark.mllib.tests
[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/13493 Merging with master Thanks all! @zjffdu Did you want to backport this to branch-2.3 too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21163: [SPARK-24097][ML] Instruments improvements - Rand...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21163#discussion_r185628460 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -22,12 +22,11 @@ import java.io.IOException import scala.collection.mutable import scala.util.Random -import org.apache.spark.internal.Logging import org.apache.spark.ml.classification.DecisionTreeClassificationModel import org.apache.spark.ml.feature.LabeledPoint import org.apache.spark.ml.regression.DecisionTreeRegressionModel import org.apache.spark.ml.tree._ -import org.apache.spark.ml.util.Instrumentation +import org.apache.spark.ml.util.{Instrumentation, OptionalInstrumentation} --- End diff -- Instrumentation is no longer needed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21163: [SPARK-24097][ML] Instruments improvements - RandomFores...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21163 Regarding logging on executors, are you OK with the proposed plan here? https://issues.apache.org/jira/browse/SPARK-23686 Basically, we'll keep using regular Logging on executors, rather than using Instrumentation on them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20235: [Spark-22887][ML][TESTS][WIP] ML test for StructuredStre...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20235 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/13493 LGTM pending fresh tests Sorry for the delay! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21185: [SPARK-23894][CORE][SQL] Defensively clear ActiveSession...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21185 There have been several of these R tests. May be from flakiness with CRAN; testing locally now (since I didn't see any recent bad commits in R). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21204: [SPARK-24132][ML]Expand instrumentation for class...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21204#discussion_r185325228 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala --- @@ -97,9 +97,10 @@ class DecisionTreeClassifier @Since("1.4.0") ( override def setSeed(value: Long): this.type = set(seed, value) override protected def train(dataset: Dataset[_]): DecisionTreeClassificationModel = { +val instr = Instrumentation.create(this, dataset) val categoricalFeatures: Map[Int, Int] = MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) -val numClasses: Int = getNumClasses(dataset) +val numClasses: Int = getNumClasses(dataset, instr = OptionalInstrumentation.create(instr)) --- End diff -- Also see what else you can log easily (e.g., numFeatures) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21204: [SPARK-24132][ML]Expand instrumentation for class...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21204#discussion_r185325179 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala --- @@ -97,9 +97,10 @@ class DecisionTreeClassifier @Since("1.4.0") ( override def setSeed(value: Long): this.type = set(seed, value) override protected def train(dataset: Dataset[_]): DecisionTreeClassificationModel = { +val instr = Instrumentation.create(this, dataset) val categoricalFeatures: Map[Int, Int] = MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) -val numClasses: Int = getNumClasses(dataset) +val numClasses: Int = getNumClasses(dataset, instr = OptionalInstrumentation.create(instr)) --- End diff -- In cases like this, you can use instr.logNumClasses() instead of relying on logging within the getNumClasses() method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21204: [SPARK-24132][ML]Expand instrumentation for class...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21204#discussion_r185324974 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala --- @@ -103,7 +103,10 @@ abstract class Classifier[ * @throws IllegalArgumentException if metadata does not specify numClasses, and the * actual numClasses exceeds maxNumClasses */ - protected def getNumClasses(dataset: Dataset[_], maxNumClasses: Int = 100): Int = { + protected def getNumClasses(dataset: Dataset[_], --- End diff -- Since we don't have Instrumentation readily available, I recommend using the old logging here. (That will also avoid this API-breaking change which is causing the MiMA failure.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20973 Rerunning tests in case the R CRAN failure was from flakiness --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21195 Rerunning tests in case the R CRAN failure was from flakiness --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20261: [SPARK-22885][ML][TEST] ML test for StructuredStreaming:...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20261 LGTM Will merge after fresh tests Thanks @WeichenXu123 and @smurakozi ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20319: [SPARK-22884][ML][TESTS] ML test for StructuredStreaming...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20319 @smurakozi Do you have time to update this? I did a full review, though it now has a small merge conflict. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20973: [SPARK-20114][ML] spark.ml parity for sequential pattern...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/20973 LGTM pending jenkins tests --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21153: [SPARK-24058][ML][PySpark] Default Params in ML s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21153#discussion_r185262870 --- Diff: python/pyspark/util.py --- @@ -61,6 +62,26 @@ def _get_argspec(f): return argspec +def majorMinorVersion(version): --- End diff -- Yes please, a few things: * throw exception when unable to parse * make it a static method of a class pyspark.util.VersionUtils * add unit test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21195: [Spark-23975][ML] Add support of array input for all clu...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/21195 add to whitelist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21153: [SPARK-24058][ML][PySpark] Default Params in ML s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21153#discussion_r185131834 --- Diff: python/pyspark/util.py --- @@ -61,6 +62,26 @@ def _get_argspec(f): return argspec +def majorMinorVersion(version): --- End diff -- Since this affects Spark core, it might be nice to put this in a separate PR. Also, shall we make this match the Scala API? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20973#discussion_r185058005 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala --- @@ -44,26 +43,37 @@ object PrefixSpan { * * @param dataset A dataset or a dataframe containing a sequence column which is *{{{Seq[Seq[_]]}}} type - * @param sequenceCol the name of the sequence column in dataset + * @param sequenceCol the name of the sequence column in dataset, rows with nulls in this column + *are ignored * @param minSupport the minimal support level of the sequential pattern, any pattern that * appears more than (minSupport * size-of-the-dataset) times will be output - * (default: `0.1`). - * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears - * less than maxPatternLength will be output (default: `10`). + * (recommended value: `0.1`). + * @param maxPatternLength the maximal length of the sequential pattern + * (recommended value: `10`). * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the * internal storage format) allowed in a projected database before * local processing. If a projected database exceeds this size, another - * iteration of distributed prefix growth is run (default: `3200`). - * @return A dataframe that contains columns of sequence and corresponding frequency. + * iteration of distributed prefix growth is run + * (recommended value: `3200`). + * @return A `DataFrame` that contains columns of sequence and corresponding frequency. + * The schema of it will be: + * - `sequence: Seq[Seq[T]]` (T is the item type) + * - `frequency: Long` --- End diff -- I had asked for this change to "frequency" from "freq," but I belatedly realized that this conflicts with the existing FPGrowth API, which uses "freq." It would be best to maintain consistency. Would you mind reverting to "freq?" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21183#discussion_r184753197 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -473,7 +475,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging { None } -val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitions { docs => +val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitionsWithIndexInternal --- End diff -- Let's not use mapPartitionsWithIndexInternal; I don't think closure cleaning is expensive enough for us to worry about here. Use mapPartitionsWithIndex instead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21183#discussion_r185049467 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -605,14 +609,16 @@ private[clustering] object OnlineLDAOptimizer { expElogbeta: BDM[Double], alpha: breeze.linalg.Vector[Double], gammaShape: Double, - k: Int): (BDV[Double], BDM[Double], List[Int]) = { + k: Int, + seed: Long): (BDV[Double], BDM[Double], List[Int]) = { val (ids: List[Int], cts: Array[Double]) = termCounts match { case v: DenseVector => ((0 until v.size).toList, v.values) case v: SparseVector => (v.indices.toList, v.values) } // Initialize the variational distribution q(theta|gamma) for the mini-batch +val randBasis = new RandBasis(new org.apache.commons.math3.random.MersenneTwister(seed)) val gammad: BDV[Double] = - new Gamma(gammaShape, 1.0 / gammaShape).samplesVector(k) // K + new Gamma(gammaShape, 1.0 / gammaShape)(randBasis).samplesVector(k) // K --- End diff -- nit: Note that the original spacing with the comment was intentional to match lines below. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r184768987 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala --- @@ -95,7 +95,9 @@ private[shared] object SharedParamsCodeGen { ParamDesc[String]("distanceMeasure", "The distance measure. Supported options: 'euclidean'" + " and 'cosine'", Some("org.apache.spark.mllib.clustering.DistanceMeasure.EUCLIDEAN"), isValid = "(value: String) => " + - "org.apache.spark.mllib.clustering.DistanceMeasure.validateDistanceMeasure(value)") + "org.apache.spark.mllib.clustering.DistanceMeasure.validateDistanceMeasure(value)"), + ParamDesc[String]("validationIndicatorCol", "the indicator column name for indicating " + --- End diff -- How about rephrasing the description: "name of the column that indicates whether each row is for training or for validation. False indicates training; true indicates validation." --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org