[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18998 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r179903481 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema) -val hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) -// TODO: Make the hashingTF.transform natively in ml framework to avoid extra conversion. -val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML } +val hashUDF = udf { (terms: Seq[_]) => + val ids = terms.map { term => --- End diff -- @sethah Hi, thank all for your review and comments. However, since it has been a quite long time with no activity, is it a good idea to close the PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r172015693 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema) -val hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) -// TODO: Make the hashingTF.transform natively in ml framework to avoid extra conversion. -val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML } +val hashUDF = udf { (terms: Seq[_]) => + val ids = terms.map { term => --- End diff -- By the way, as the PR doesn't change the function's behavior, I think it should be correct if all tests pass, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r172015644 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema) -val hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) -// TODO: Make the hashingTF.transform natively in ml framework to avoid extra conversion. -val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML } +val hashUDF = udf { (terms: Seq[_]) => + val ids = terms.map { term => --- End diff -- Sorry for the delay. Yes, original implement use mutable map in mllib package, which is different with the PR. However, I think the PR is more clear and their logic are the same. Which one is faster in your test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r171425701 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema) -val hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) -// TODO: Make the hashingTF.transform natively in ml framework to avoid extra conversion. -val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML } +val hashUDF = udf { (terms: Seq[_]) => + val ids = terms.map { term => --- End diff -- Ok, I'm just wondering why you changed the code? The old one uses a mutable hashmap, which is different than the approach here (and is slower, in my tests). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r171412547 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema) -val hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) -// TODO: Make the hashingTF.transform natively in ml framework to avoid extra conversion. -val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML } +val hashUDF = udf { (terms: Seq[_]) => + val ids = terms.map { term => --- End diff -- Sorry, I can't remember all details exactly since the pr is too old. if my memory is correct, the ML implementation keep consistent with MLLIB (old one). As the "TODO" above said > Make the hashingTF.transform natively in ml framework to avoid extra conversion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/18998#discussion_r171025256 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema) -val hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) -// TODO: Make the hashingTF.transform natively in ml framework to avoid extra conversion. -val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML } +val hashUDF = udf { (terms: Seq[_]) => + val ids = terms.map { term => --- End diff -- Why did you implement this differently than the old one? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18998 [SPARK-21748][ML] Migrate the implementation of HashingTF from MLlib to ML ## What changes were proposed in this pull request? Migrate the implementation of HashingTF from MLlib to ML. ## How was this patch tested? + [ ] Pass all unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/migrate_hash_tf_to_ml Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18998.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18998 commit 6a25da2cc96a744aaf047280ac414e5ff4515434 Author: Yan Facai (é¢åæ) Date: 2017-08-19T02:24:14Z ENH: implement HashingTF in ml --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org