Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-120480212
I think it'd be nice to have. Feel free to take code from that example.
The CountVectorizer PR or a later PR could modify the LDA example to use
CountVectorizer.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/7084
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-120078686
LGTM merging with master
Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user hhbyyh commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-120196272
Thanks @jkbradley , just want to know if you are interested in
CountVectorizer. I assume it will be similar to the pre-process in LDA example.
---
If your project is
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-119906668
[Test build #36922 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36922/console)
for PR 7084 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-119906729
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-119897303
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-119897282
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-119897659
[Test build #36922 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36922/consoleFull)
for PR 7084 at commit
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-120135443
@hhbyyh Could you please make follow-up JIRAs?
* CountVectorizer (which does estimation)
* Python API
* documentation
Thanks!
---
If your project
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r34212260
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizerModel.scala ---
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r34212256
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizerModel.scala ---
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-119763057
@hhbyyh Thank you for the updates! Other than those 2 nits, it looks good.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-118735791
[Test build #36560 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36560/console)
for PR 7084 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-118735867
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-118662011
I agree we should add an Estimator version of CountVectorizer which first
fits on the data to build a dictionary. Because of that, maybe we should call
this PR's
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33900057
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33900062
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizorSuite.scala ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33900056
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33900058
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33900055
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33900061
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizorSuite.scala ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33900063
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizorSuite.scala ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-118675030
That's all for a first pass!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-118725374
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-118725388
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-118726940
[Test build #36560 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36560/consoleFull)
for PR 7084 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-118729997
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user hhbyyh commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-118725060
Thank a lot @jkbradley. I sent an update with:
1. change the class name to CountVectorizerModel.
2. make vocab a val.
3. change minTermCount to minTermFreq and
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-118724201
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-118724246
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33902864
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33902675
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software
Github user feynmanliang commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33722239
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-117889791
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-117889782
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-117890043
[Test build #36329 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36329/consoleFull)
for PR 7084 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-117898009
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-117897863
[Test build #36329 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36329/console)
for PR 7084 at commit
Github user hhbyyh commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-117860374
Yes that's the plan (an estimator). And I know jkbradley has a similar
implementation in LDA example. If Joseph is interested in migrating it here (
perhaps another
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116639479
[Test build #35982 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35982/consoleFull)
for PR 7084 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116652920
[Test build #35981 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35981/console)
for PR 7084 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116663360
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116638087
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116638157
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116652988
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116663308
[Test build #35982 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35982/console)
for PR 7084 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116628270
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116628338
[Test build #35981 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35981/consoleFull)
for PR 7084 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116628258
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
GitHub user hhbyyh opened a pull request:
https://github.com/apache/spark/pull/7084
[Spark-8703] [ML] Add CountVectorizer as a ml transformer to convert
document to words count vector
jira: https://issues.apache.org/jira/browse/SPARK-8703
Converts a text document to a
Github user feynmanliang commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33527772
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software
Github user feynmanliang commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33528713
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33536660
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116926783
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116926927
[Test build #36075 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36075/consoleFull)
for PR 7084 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116926792
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33536778
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116943301
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/7084#issuecomment-116942989
[Test build #36075 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36075/console)
for PR 7084 at commit
Github user feynmanliang commented on a diff in the pull request:
https://github.com/apache/spark/pull/7084#discussion_r33542097
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software
61 matches
Mail list logo