[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-04-25 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/18998


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-04-06 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18998#discussion_r179903481
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema)
-val hashingTF = new 
feature.HashingTF($(numFeatures)).setBinary($(binary))
-// TODO: Make the hashingTF.transform natively in ml framework to 
avoid extra conversion.
-val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML }
+val hashUDF = udf { (terms: Seq[_]) =>
+  val ids = terms.map { term =>
--- End diff --

@sethah Hi, thank all for your review and comments. However, since it has 
been a quite long time with no activity, is it a good idea to close the PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-03-03 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18998#discussion_r172015693
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema)
-val hashingTF = new 
feature.HashingTF($(numFeatures)).setBinary($(binary))
-// TODO: Make the hashingTF.transform natively in ml framework to 
avoid extra conversion.
-val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML }
+val hashUDF = udf { (terms: Seq[_]) =>
+  val ids = terms.map { term =>
--- End diff --

By the way, as the PR doesn't change the function's behavior, I think it 
should be correct if all tests pass, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-03-03 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18998#discussion_r172015644
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema)
-val hashingTF = new 
feature.HashingTF($(numFeatures)).setBinary($(binary))
-// TODO: Make the hashingTF.transform natively in ml framework to 
avoid extra conversion.
-val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML }
+val hashUDF = udf { (terms: Seq[_]) =>
+  val ids = terms.map { term =>
--- End diff --

Sorry for the delay. Yes, original implement use mutable map in mllib 
package, which is different with the PR. However, I think the PR is more clear 
and their logic are the same. Which one is faster in your test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-02-28 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/18998#discussion_r171425701
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema)
-val hashingTF = new 
feature.HashingTF($(numFeatures)).setBinary($(binary))
-// TODO: Make the hashingTF.transform natively in ml framework to 
avoid extra conversion.
-val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML }
+val hashUDF = udf { (terms: Seq[_]) =>
+  val ids = terms.map { term =>
--- End diff --

Ok, I'm just wondering why you changed the code? The old one uses a mutable 
hashmap, which is different than the approach here  (and is slower, in my 
tests).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-02-28 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18998#discussion_r171412547
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema)
-val hashingTF = new 
feature.HashingTF($(numFeatures)).setBinary($(binary))
-// TODO: Make the hashingTF.transform natively in ml framework to 
avoid extra conversion.
-val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML }
+val hashUDF = udf { (terms: Seq[_]) =>
+  val ids = terms.map { term =>
--- End diff --

Sorry, I can't remember all details exactly since the pr is too old. if my 
memory is correct, the ML implementation keep consistent with MLLIB (old one). 
As the "TODO" above said
> Make the hashingTF.transform natively in ml framework to avoid extra 
conversion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2018-02-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/18998#discussion_r171025256
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -93,11 +97,21 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema)
-val hashingTF = new 
feature.HashingTF($(numFeatures)).setBinary($(binary))
-// TODO: Make the hashingTF.transform natively in ml framework to 
avoid extra conversion.
-val t = udf { terms: Seq[_] => hashingTF.transform(terms).asML }
+val hashUDF = udf { (terms: Seq[_]) =>
+  val ids = terms.map { term =>
--- End diff --

Why did you implement this differently than the old one?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18998: [SPARK-21748][ML] Migrate the implementation of H...

2017-08-18 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18998

[SPARK-21748][ML] Migrate the implementation of HashingTF from MLlib to ML

## What changes were proposed in this pull request?

Migrate the implementation of HashingTF from MLlib to ML.

## How was this patch tested?

+ [ ] Pass all unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark ENH/migrate_hash_tf_to_ml

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18998.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18998


commit 6a25da2cc96a744aaf047280ac414e5ff4515434
Author: Yan Facai (颜发才) 
Date:   2017-08-19T02:24:14Z

ENH: implement HashingTF in ml




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org