[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...
GitHub user ymazari opened a pull request: https://github.com/apache/spark/pull/20366 [SPARK-23166] [ML] Add maxDF Parameter to CountVectorizer ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ymazari/spark SPARK-23166 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20366.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20366 commit 568ea65ff32dce179097098b5d2934df20cac17c Author: Yacine Mazari Date: 2018-01-22T13:22:58Z [SPARK-23166] [ML] Add maxDF Parameter to CountVectorizer --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...
Github user ymazari closed the pull request at: https://github.com/apache/spark/pull/20366 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20366#discussion_r163353845 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -113,7 +132,7 @@ private[feature] trait CountVectorizerParams extends Params with HasInputCol wit /** @group getParam */ def getBinary: Boolean = $(binary) - setDefault(vocabSize -> (1 << 18), minDF -> 1.0, minTF -> 1.0, binary -> false) + setDefault(vocabSize -> (1 << 18), minDF -> 1.0, minTF -> 1.0, binary -> false, maxDF -> Long.MaxValue) --- End diff -- what about avoiding to set a default value and apply the filter only if it was set? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20366#discussion_r163353940 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -169,7 +197,7 @@ class CountVectorizer @Since("1.5.0") (@Since("1.5.0") override val uid: String) }.reduceByKey { case ((wc1, df1), (wc2, df2)) => (wc1 + wc2, df1 + df2) }.filter { case (word, (wc, df)) => - df >= minDf + (df >= minDf) && (df <= maxDf) --- End diff -- nit: the parenthesis are not needed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20366#discussion_r163354161 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala --- @@ -119,6 +119,41 @@ class CountVectorizerSuite extends SparkFunSuite with MLlibTestSparkContext } } + test("CountVectorizer maxDF") { +val df = Seq( + (0, split("a b c d"), Vectors.sparse(3, Seq((0, 1.0), (1, 1.0), (2, 1.0, + (1, split("a b c"), Vectors.sparse(3, Seq((0, 1.0), (1, 1.0, + (2, split("a b"), Vectors.sparse(3, Seq((0, 1.0, + (3, split("a"), Vectors.sparse(3, Seq())) +).toDF("id", "words", "expected") + +// maxDF: ignore terms with count more than 3 +val cvModel = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setMaxDF(3) + .fit(df) +assert(cvModel.vocabulary === Array("b", "c", "d")) + +cvModel.transform(df).select("features", "expected").collect().foreach { + case Row(features: Vector, expected: Vector) => +assert(features ~== expected absTol 1e-14) +} + +// maxDF: ignore terms with freq > 0.75 +val cvModel2 = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setMaxDF(0.75) + .fit(df) +assert(cvModel2.vocabulary === Array("b", "c", "d")) + +cvModel2.transform(df).select("features", "expected").collect().foreach { + case Row(features: Vector, expected: Vector) => +assert(features ~== expected absTol 1e-14) +} --- End diff -- may you please also add a UT to check that setting both `maxDF` and `minDF` works as expected? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...
Github user ymazari commented on a diff in the pull request: https://github.com/apache/spark/pull/20366#discussion_r163355088 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -169,7 +197,7 @@ class CountVectorizer @Since("1.5.0") (@Since("1.5.0") override val uid: String) }.reduceByKey { case ((wc1, df1), (wc2, df2)) => (wc1 + wc2, df1 + df2) }.filter { case (word, (wc, df)) => - df >= minDf + (df >= minDf) && (df <= maxDf) --- End diff -- Right. I just added them for clarity. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20366: [SPARK-23166] [ML] Add maxDF Parameter to CountVe...
Github user ymazari commented on a diff in the pull request: https://github.com/apache/spark/pull/20366#discussion_r163355218 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala --- @@ -119,6 +119,41 @@ class CountVectorizerSuite extends SparkFunSuite with MLlibTestSparkContext } } + test("CountVectorizer maxDF") { +val df = Seq( + (0, split("a b c d"), Vectors.sparse(3, Seq((0, 1.0), (1, 1.0), (2, 1.0, + (1, split("a b c"), Vectors.sparse(3, Seq((0, 1.0), (1, 1.0, + (2, split("a b"), Vectors.sparse(3, Seq((0, 1.0, + (3, split("a"), Vectors.sparse(3, Seq())) +).toDF("id", "words", "expected") + +// maxDF: ignore terms with count more than 3 +val cvModel = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setMaxDF(3) + .fit(df) +assert(cvModel.vocabulary === Array("b", "c", "d")) + +cvModel.transform(df).select("features", "expected").collect().foreach { + case Row(features: Vector, expected: Vector) => +assert(features ~== expected absTol 1e-14) +} + +// maxDF: ignore terms with freq > 0.75 +val cvModel2 = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setMaxDF(0.75) + .fit(df) +assert(cvModel2.vocabulary === Array("b", "c", "d")) + +cvModel2.transform(df).select("features", "expected").collect().foreach { + case Row(features: Vector, expected: Vector) => +assert(features ~== expected absTol 1e-14) +} --- End diff -- I will. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org