[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-02 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-65223123 @jkbradley, thank you for you comments! It seems like we should discuss API for this set of models first. As far as I can understand, you are not about to

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66109601 (1) Users implementing their own regularizers OK. I'd prefer to set all the methods private[mllib] for regularizers. (2) Regular and Robust in the

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-08 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66127106 @jkbradley, could you please have a look at logs -- a have no idea why PySpark tests failed. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66292305 @jkbradley Tests fail again... Stab in the dark: looks like something is changed in the testing environment. (2) Regular and Robust in the same

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66299321 @chazchandler, thank you very much for your quick reply! It did the trick. Now I'm a bit confused about ml/ folder. What's it for? --- If your proj

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66306112 It seems like something went wrong. I've got multiple compilation errors like ``` [error] /home/valerij/contribute/spark/core/src/main/scala/org/a

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66321327 @karlhigley, yes I've heard something about abstract classes. Though, I see no way to employ this concept here. --- If your project is set up for it, you can rep

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66325635 @chazchandler, thank you very much for your help. I shouldn't have rebase on master. Rebase on 1.2 was successful. --- If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66478011 Succeeded at the third attempt. (5) Enumerator @jkbradley, as you can see, I moved `Enumerator` to `mllib/features` folder and renamed it to `TokenIndexer

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66498633 (5) Enumerator BTW, names `TokenIndexer` and `TokenIndex` look confusive (though, these classes rely on `breeze.util.Index`). So I renamed it to

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-11 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66613030 @jkbradley I moved Dirichlet to mllib/stats and added setters to `TokenEnumerator`. BTW, why was it decided to use setter instead of constructors? We

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-17 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/1269#discussion_r22003692 --- Diff: mllib/pom.xml --- @@ -112,6 +112,11 @@ test-jar test + +colt --- End diff -- In

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-17 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67399691 @jkbradley Thank you for explanation about setters. tm implementation was tested (it was succesfully used in one of my project) but it was tested with scala 2.11

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-17 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67410274 ``` - filter pushdown - boolean *** FAILED *** (249 milliseconds)``` I have no idea why this could happen. Should I rebase again? --- If your project is set up

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-17 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67415235 What do you mean by scaling tests? Tests measuring the dependence of computation time on numer of machines? Are there scaling tests for GraphX LDA implementations? Or

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-17 Thread akopich
GitHub user akopich reopened a pull request: https://github.com/apache/spark/pull/1269 [SPARK-2199] [mllib] topic modeling I have implemented Probabilistic Latent Semantic Analysis (PLSA) and Robust PLSA with support of additive regularization (that actually means that I&#x

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-17 Thread akopich
Github user akopich closed the pull request at: https://github.com/apache/spark/pull/1269 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-18 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67493934 How do you compare accuracy? Perplexity means nothing but perplexity -- topic models may be reliably compared only via application task (e.g. classification

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-19 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67643630 I've performed sanity check on the dataset i've described above. PLSA: tm project obtains perplexity of `2358` and this implementation ends up

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-19 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67656496 And tests fail again in obscure manner... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-19 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67661902 I've fixed perplexity for robust plsa and updates perplexity value in the comment above. Now they are almost the same. --- If your project is set up for it, yo

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-19 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67664969 By the way. May be it's off top, but this is related to initial approximation generation. Suppose, one has `indxs : RDD[Int]` and is about to create an R

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2015-03-10 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-78050367 @renchengchang 1. Hi. 2. Don't use code from this PR. Use either LDA (which is merged with mllib) or https://github.com/akopich/dplsa which is a fu

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2015-03-10 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-78184948 @renchengchang What do you mean by "topic vector"? A vector of p(t|d) \forall t? If so, you can find these vectors in `RDD[DocumentParameters]` which is r

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2015-03-10 Thread akopich
Github user akopich closed the pull request at: https://github.com/apache/spark/pull/1269 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

<    1   2