[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-10 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-120480212 I think it'd be nice to have. Feel free to take code from that example. The CountVectorizer PR or a later PR could modify the LDA example to use CountVectorizer.

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-09 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7084 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-09 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-120078686 LGTM merging with master Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-09 Thread hhbyyh
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-120196272 Thanks @jkbradley , just want to know if you are interested in CountVectorizer. I assume it will be similar to the pre-process in LDA example. --- If your project is

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-09 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-119906668 [Test build #36922 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36922/console) for PR 7084 at commit

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-119906729 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-119897303 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-119897282 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-09 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-119897659 [Test build #36922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36922/consoleFull) for PR 7084 at commit

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-09 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-120135443 @hhbyyh Could you please make follow-up JIRAs? * CountVectorizer (which does estimation) * Python API * documentation Thanks! --- If your project

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r34212260 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizerModel.scala --- @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r34212256 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizerModel.scala --- @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-08 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-119763057 @hhbyyh Thank you for the updates! Other than those 2 nits, it looks good. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-06 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-118735791 [Test build #36560 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36560/console) for PR 7084 at commit

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-118735867 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-118662011 I agree we should add an Estimator version of CountVectorizer which first fits on the data to build a dictionary. Because of that, maybe we should call this PR's

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33900057 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33900062 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizorSuite.scala --- @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33900056 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33900058 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33900055 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33900061 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizorSuite.scala --- @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33900063 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizorSuite.scala --- @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-118675030 That's all for a first pass! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-118725374 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-118725388 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-118726940 [Test build #36560 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36560/consoleFull) for PR 7084 at commit

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-118729997 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread hhbyyh
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-118725060 Thank a lot @jkbradley. I sent an update with: 1. change the class name to CountVectorizerModel. 2. make vocab a val. 3. change minTermCount to minTermFreq and

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-118724201 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-118724246 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33902864 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-05 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33902675 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-01 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33722239 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-117889791 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-117889782 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-01 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-117890043 [Test build #36329 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36329/consoleFull) for PR 7084 at commit

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-117898009 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-01 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-117897863 [Test build #36329 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36329/console) for PR 7084 at commit

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-07-01 Thread hhbyyh
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-117860374 Yes that's the plan (an estimator). And I know jkbradley has a similar implementation in LDA example. If Joseph is interested in migrating it here ( perhaps another

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116639479 [Test build #35982 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35982/consoleFull) for PR 7084 at commit

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116652920 [Test build #35981 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35981/console) for PR 7084 at commit

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116663360 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116638087 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116638157 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116652988 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116663308 [Test build #35982 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35982/console) for PR 7084 at commit

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116628270 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116628338 [Test build #35981 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35981/consoleFull) for PR 7084 at commit

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116628258 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread hhbyyh
GitHub user hhbyyh opened a pull request: https://github.com/apache/spark/pull/7084 [Spark-8703] [ML] Add CountVectorizer as a ml transformer to convert document to words count vector jira: https://issues.apache.org/jira/browse/SPARK-8703 Converts a text document to a

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33527772 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33528713 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33536660 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116926783 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116926927 [Test build #36075 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36075/consoleFull) for PR 7084 at commit

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116926792 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33536778 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116943301 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7084#issuecomment-116942989 [Test build #36075 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36075/console) for PR 7084 at commit

[GitHub] spark pull request: [Spark-8703] [ML] Add CountVectorizer as a ml ...

2015-06-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/7084#discussion_r33542097 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software