[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17673 @ngopal this one can't be merged as-is and looks like it was abandoned. Would you like to take this PR, update per reviews? I'd review that. I think CBOW could be useful in MLlib. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user ngopal commented on the issue: https://github.com/apache/spark/pull/17673 When can we anticipate this branch being merged? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17673 @shubhamchopra are you still working on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17673 Jenkins OK to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user shubhamchopra commented on the issue: https://github.com/apache/spark/pull/17673 @hhbyyh Thanks for your suggestions. Will try to incorporate these in a day or so. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82569/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17673 **[Test build #82569 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82569/testReport)** for PR 17673 at commit [`9090b96`](https://github.com/apache/spark/commit/9090b967e03e43e3a709d9c2c94fe75de5b9a8e6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17673 **[Test build #82569 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82569/testReport)** for PR 17673 at commit [`9090b96`](https://github.com/apache/spark/commit/9090b967e03e43e3a709d9c2c94fe75de5b9a8e6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82568/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17673 **[Test build #82568 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82568/testReport)** for PR 17673 at commit [`236e4c1`](https://github.com/apache/spark/commit/236e4c1db2051f2f8a0435e753df3579afdfeb5e). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17673 **[Test build #82568 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82568/testReport)** for PR 17673 at commit [`236e4c1`](https://github.com/apache/spark/commit/236e4c1db2051f2f8a0435e753df3579afdfeb5e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user shubhamchopra commented on the issue: https://github.com/apache/spark/pull/17673 Thanks for your comments/suggestions @MLnick and @sethah . Working on incorporating these. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82005/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17673 **[Test build #82005 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82005/testReport)** for PR 17673 at commit [`64a5a6b`](https://github.com/apache/spark/commit/64a5a6b2b3cacedc82b24bde9347fee272b78849). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17673 **[Test build #82005 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82005/testReport)** for PR 17673 at commit [`64a5a6b`](https://github.com/apache/spark/commit/64a5a6b2b3cacedc82b24bde9347fee272b78849). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81320/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17673 **[Test build #81320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81320/testReport)** for PR 17673 at commit [`361d79d`](https://github.com/apache/spark/commit/361d79ddeab78889cd5a0a63f21d1e446a7a34fd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17673 **[Test build #81320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81320/testReport)** for PR 17673 at commit [`361d79d`](https://github.com/apache/spark/commit/361d79ddeab78889cd5a0a63f21d1e446a7a34fd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81231/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17673 **[Test build #81231 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81231/testReport)** for PR 17673 at commit [`948cc15`](https://github.com/apache/spark/commit/948cc15b67113b8ad74b67eeef13a39f55b7313a). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17673 **[Test build #81231 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81231/testReport)** for PR 17673 at commit [`948cc15`](https://github.com/apache/spark/commit/948cc15b67113b8ad74b67eeef13a39f55b7313a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80243/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17673 **[Test build #80243 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80243/testReport)** for PR 17673 at commit [`feda8dc`](https://github.com/apache/spark/commit/feda8dce8c2832bd1a3c61a84bfac9a23629866a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17673 **[Test build #80243 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80243/testReport)** for PR 17673 at commit [`feda8dc`](https://github.com/apache/spark/commit/feda8dce8c2832bd1a3c61a84bfac9a23629866a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17673 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user shubhamchopra commented on the issue: https://github.com/apache/spark/pull/17673 Code-review comments/suggestions so far have been incorporated. Thanks for looking into the code. Happy to incorporate more suggestions and feedback. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user shubhamchopra commented on the issue: https://github.com/apache/spark/pull/17673 @MLnick I half expected that. No worries. I have incorporated some of your feedback in the meantime and also added subsampling as well. Thanks for looking into the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17673 FYI, realistically there won't be bandwidth to really focus on this until after Spark 2.2 QA is done at the earliest. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user Krimit commented on the issue: https://github.com/apache/spark/pull/17673 Thanks for the detailed response @shubhamchopra. I'd like to clarify my point about whether this should be implemented in Spark: Spark MlLib is first and foremost a framework for doing ML on large datasets where other existing implementations (such as ``scikit-learn``) are impractical. A reality of ML is that often increasing the size (and quality) of the training data is much more important than tweaking model hyper-parameters. Therefore as a community, I think our focus should be more on robustness than on "completeness". While having additional algorithms available for tuning can helpful, I would personally be more interested in additions that offer significant and clear benefits (such as ``GloVe`` which should be much faster to train and a really good fit for Spark due the natural parallelization of the problem). With that said, I'm not opposed to adding CBOW, so long as we vet it. As part of having this merged in, I think ideally we should run an experiment on a large-ish dataset (wikipedia?) comparing the two implementations --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user shubhamchopra commented on the issue: https://github.com/apache/spark/pull/17673 @Krimit _Can you provide some information about the practical differences between CBOW and skip-grams?_ ![Model Architectures](https://cloud.githubusercontent.com/assets/6588487/25546610/d0f95aa8-2c31-11e7-8b47-4f9d31254f0f.png) As mentioned in [this paper](https://arxiv.org/pdf/1301.3781.pdf), CBOW model looks at the words around a target word, and tries to predict the target word. SkipGram does just the opposite. Given a target word, it tries to predict the context words around it. The prediction is done using a very simple neural network with a single hidden layer. _Wikipedia quotes the author (I assume they mean Tomas) as saying that CBOW is faster while skip-gram is slower but does a better job for infrequent words. Has this been your experience as well? How pronounced is the difference?_ The current CBOW + Negative Sampling I found to take almost the same time as the existing SkipGram + Hierarchical sampling. The negative sampling is tunable, and the performance will be slower for a higher number of negative samples. _in what cases would a user choose one over the other? I'm basically seconding @hhbyyh's comment on a more in-depth comparison experiment._ There is a good amount of research around this with comparison experiments. It appears to largely depend on the application embeddings would be used for. [Levy et al](http://www.aclweb.org/anthology/Q15-1016) show how different methods perform with extensive experiments. They used the embeddings to perform similarity, relatedness and other tests on some open datasets. [Mikolov et al](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) found SkipGram with Negative Sampling to outperform CBOW. [Baroni et al](http://anthology.aclweb.org/P/P14/P14-1023.pdf) found that CBOW had a slight advantage. [Levy et al](http://www.aclweb.org/anthology/Q15-1016) explain that while CBOW did not perform as well in their experiments, others have shown that capturing joint contexts (CBOW does this) can improve performance on word similarity tasks. They also saw CBOW to perform well in analogy tasks. So again, it depends on the task being performed. [Mikolov et al](https://arxiv.org/pdf/1309.4168.pdf) recommend using Skip-Gram when mono-lingual data is small and CBOW for larger datasets. _The fact that the original paper has both implementations is not in itself enough of a reason for Spark to do the same, IMO_ This is an active area of research, and both methods generate embeddings that perform well on different tasks. As a library providing these implementations, the choice I think is best left to the user and the application it is being used for. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user shubhamchopra commented on the issue: https://github.com/apache/spark/pull/17673 @Krimit @MLnick @hhbyyh I am working on getting your earlier queries answered. @Krimit Thanks for looking into the code, I will try to get the code-review feedback incorporated in a couple of day or so. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user Krimit commented on the issue: https://github.com/apache/spark/pull/17673 @shubhamchopra have you run this code in a distributed spark cluster yet? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17673 I can maybe help out a bit in a week and a bit (I've also done some poking inside of Word2Vec) but I need to wrap up some travel and Python stuff first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user Krimit commented on the issue: https://github.com/apache/spark/pull/17673 I'm happy to take a look! I'll have some time to dig in deeper tomorrow. Some of my initial impressions: * There's a lot going on here, I agree with @hhbyyh that it would be cleaner to put the CBOW code in a new class * Can you provide some information about the practical differences between CBOW and skip-grams? Wikipedia quotes the author (I assume they mean Tomas) as saying that ``CBOW is faster while skip-gram is slower but does a better job for infrequent words``. Has this been your experience as well? How pronounced is the difference? in what cases would a user choose one over the other? I'm basically seconding @hhbyyh's comment on a more in-depth comparison experiment. The fact that the original paper has both implementations is not in itself enough of a reason for Spark to do the same, IMO --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17673 It would be ideal to have both methods, but I'm worried about reviewer bandwidth vs priority on this. @Krimit you were working on Word2Vec recently - thoughts? Perhaps you have time to help on review also? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/17673 Thanks for working on this, I'm traveling right now but maybe @MLNick has some bandwith to look at this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user shubhamchopra commented on the issue: https://github.com/apache/spark/pull/17673 The [original paper](https://arxiv.org/abs/1301.3781) proposed two model architectures for generating word embeddings, Continuous Skip-Gram model and continuous Bag-of-words model. Spark ML currently only implements the SkipGram model. This PR adds the continuous bag of words model. As such the models compete with each other, and this implementation would give users options to settle on one which suits their data best. The implementation is based largely on the [original C implementation](https://code.google.com/archive/p/word2vec/). I implemented this using Negative Sampling, as that was shown to have good performance [here](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). I tried to vectorize operations using BLAS where possible. I don't understand what you mean by "MLP" implementation. Can you please clarify? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/17673 Thanks for sharing the work. To help make the review easier, I would recommend: 1. Provide some background info. Is the new algorithm better than the existing one and in which cases? compare with other lib or implementation of the algorithm. 2. Provide some description about your implementation. algorithm accuracy, scalability compared with the existing Word2Vec. Is there any know issue or the limitation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17673: [SPARK-20372] [ML] Word2Vec Continuous Bag of Words mode...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17673 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org