[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2018-02-03 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh who shall we ping? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2018-02-01 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh So, I guess, I should just roll the refactoring back, right? --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-13 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh, is there a cluster I can use for this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-12 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @WeichenXu123 @jkbradley said, pings on Git don't work for him... --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-07 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 ping @WeichenXu123 , @srowen , @hhbyyh Further comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-11-02 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r148517729 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-11-02 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r148507781 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -497,40 +481,46 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-11-02 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r148506477 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -497,40 +481,46 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-01 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Okay... any idea why tests failed? It says ```ERROR: Step ?Publish JUnit test result report? failed: No test report files were found. Configuration error

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-01 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @WeichenXu123, in a case of large dataset this "adjustment" would have infinitesimal effect. (IMO, no adjustment is needed -- the expected number of non-empty docs in the same and does

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh OK, but it returns almost the same number of elements. Anyway, the variance is going to be much smaller that in the case with sample before filter

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh Yes, in this way we don't change semantics of `miniBatchFraction`. But is the way it is defined now actually correct? As I mentioned above, in the `upstram/master` the number

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Ping @hhbyyh, @WeichenXu123, @srowen. Seems like the discussion is stuck. Does anybody think that the general approach implemented in this PR should be changed? Currently it is filtering

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-10-26 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r147230366 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-10-26 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r147229232 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh, in case of "filter before sample" in a local test the overhead is negligible. Regarding "sample before filter", you are right. There (strictly speaking)

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-10-26 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r147208156 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-10-26 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r147208062 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 And the empty docs were not explicitly filtered out. They've just been ignored in `submitMiniBatch`. --- - To unsubscribe, e

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 I'm saying they are not the same, but for larger datasets this should not matter. There is a change in logic. The hack with `val batchSize = (miniBatchFraction * corpusSize).ceil.toInt

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Consider the following scenario. Let `docs` be an RDD containing 1000 empty documents and 1000 non-empty documents and let `miniBatchFraction = 0.05`. Assume, we use `filter(...).sample

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @WeichenXu123, yes there indeed is a difference in logic. Eventually it boils down to semantics of `miniBatchFraction`. If it is a fraction of non-empty documents being sampled, the version

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 I am sure that caching may by avoided here. Hence, it should not be used. @srowen, maybe I don't get something, but I'm afraid, that currently lineage for a single mini-batch submission

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-10-25 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r147021726 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-10-25 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r146882424 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Or (and I think it would be the most efficient approach) we can just stick in the check for emptiness of the document to the `seqOp` of `treeAggregate`. However, it doesn't look like "filterin

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Now I feel that filtering empty docs out in the `initialize` is not a good idea, because it will be performed as many times, as the number of times `sample` in `next` gets called. Right

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-10-25 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r146820166 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-10-25 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r146812501 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-10-25 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r146804206 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-10-24 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r146572407 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-10-24 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/19565#discussion_r146571987 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -415,7 +415,8 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-24 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @WeichenXu123, @hhbyyh, looking forward to your opinion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

2017-10-24 Thread akopich
GitHub user akopich opened a pull request: https://github.com/apache/spark/pull/19565 [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter out empty documents beforehand ## What changes were proposed in this pull request? The empty documents are filtered out

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-18 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @jkbradley, no problem. @jkbradley, @WeichenXu123, @hhbyyh, thank you all guys! --- - To unsubscribe, e-mail: reviews

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-18 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 ping @jkbradley. Anyway, tests are passed now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-18 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-18 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @jkbradley, no problem. The test build seems to be aborted. What's wrong? --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-12 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @WeichenXu123, no problem! Thank you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-11 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @WeichenXu123, yes sure. But can this wait until this PR is merged? --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-09 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @WeichenXu123, could you please notify @jkbradley once again? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-06 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 So shall we ping @jkbradley, shan't we? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-06 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r143159334 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-05 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r143084875 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -503,21 +533,22 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-05 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r143084656 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-05 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r143069049 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-05 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r143066229 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-05 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r143064794 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-05 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r143060674 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-05 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r143060537 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-05 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 I have conducted some performance testing with random data. The new implementation turns out to be notably faster. ``` OLD with hyper-parameter optimization : 237 sec OLD

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-05 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r143003890 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +462,54 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-05 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 Thank you, @hhbyyh. I have augmented the example a bit: explicitly set random seed a nd chosen online optimizer: `val lda = new LDA().setK(10).setMaxIter(10).setOptimizer("o

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-04 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @jkbradley, thank you! - Correctness: in order to test the equivalence of two versions of `submitMiniBatch` I have to bring both of them into the scope... One solution would be to derive

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-04 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 BTW. Seems like `updateLambda` method relies (in older version as well) on `batchSize` only because this is `an optimization to avoid batch.count`. Shouldn't we rather use `nonEmptyDocsN` instead

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-04 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @hhbyyh, this change does not target performance but scalability, and I am afraid, the change is beneficial only for huge datasets and the tests would require massive computational resources

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-04 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @WeichenXu123. thank you --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142632240 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142625490 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142624984 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142624246 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142624340 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142624093 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142622117 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-10-04 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r142620788 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-03 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @WeichenXu123, the PR seems to receive no attention for 10 days now... What should I do? --- - To unsubscribe, e-mail: reviews

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-09-27 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @WeichenXu123, @jkbradley, talking of merging. Is there anything else I should improve in this PR in order for it to be mergeable

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-09-23 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @WeichenXu123, thanks for creating Jira. Yes, sure I will work on it. --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-09-23 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @jkbradley, thanks for the comments. Who is supposed to create the followup jira? --- - To unsubscribe, e-mail: reviews

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-23 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140630215 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-21 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140199136 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-21 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140193380 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-21 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140183412 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-21 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140180799 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-09-20 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @jkbradley, thank you for your comments! Please, check out the commit adding the necessary docs. Regarding tests: I believe, `OnlineLDAOptimizer alpha hyperparameter optimization` from

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-20 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140032198 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-20 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r140031900 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +462,46 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-09-18 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @WeichenXu123, thank you for your prompt reply! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-18 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r139514402 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +462,44 @@ final class OnlineLDAOptimizer extends

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-09-18 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/18924#discussion_r139514301 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -462,31 +462,44 @@ final class OnlineLDAOptimizer extends

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-09-18 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 Ping @jkbradley . Thank you @WeichenXu123 one again for the comment! Please, have a look. --- - To unsubscribe, e-mail

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-09-13 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 Yes, sure. Thank you for the valuable comment. Hopefully, I'll update the code this week. --- - To unsubscribe, e-mail: reviews

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-08-22 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/18924 @feynmanliang , @hhbyyh, @WeichenXu123, could you please review the PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should n...

2017-08-11 Thread akopich
GitHub user akopich opened a pull request: https://github.com/apache/spark/pull/18924 [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver Hi, as it was proposed by Joseph K. Bradley, gammat are not collected

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2015-03-10 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-78050367 @renchengchang 1. Hi. 2. Don't use code from this PR. Use either LDA (which is merged with mllib) or https://github.com/akopich/dplsa which is a further

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2015-03-10 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-78184948 @renchengchang What do you mean by topic vector? A vector of p(t|d) \forall t? If so, you can find these vectors in `RDD[DocumentParameters]` which is returned

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2015-03-10 Thread akopich
Github user akopich closed the pull request at: https://github.com/apache/spark/pull/1269 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501440 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501548 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2015-01-12 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-69560296 @jkbradley, @mengxr, please, include @IlyaKozlov as author too. He's helped a lot with the implementation. --- If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-19 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67643630 I've performed sanity check on the dataset i've described above. PLSA: tm project obtains perplexity of `2358` and this implementation ends up with `2311

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-19 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67656496 And tests fail again in obscure manner... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-19 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67661902 I've fixed perplexity for robust plsa and updates perplexity value in the comment above. Now they are almost the same. --- If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-19 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67664969 By the way. May be it's off top, but this is related to initial approximation generation. Suppose, one has `indxs : RDD[Int]` and is about to create an RDD

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-18 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67493934 How do you compare accuracy? Perplexity means nothing but perplexity -- topic models may be reliably compared only via application task (e.g. classification

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-17 Thread akopich
Github user akopich commented on a diff in the pull request: https://github.com/apache/spark/pull/1269#discussion_r22003692 --- Diff: mllib/pom.xml --- @@ -112,6 +112,11 @@ typetest-jar/type scopetest/scope /dependency +dependency

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-17 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67399691 @jkbradley Thank you for explanation about setters. tm implementation was tested (it was succesfully used in one of my project) but it was tested with scala 2.11

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-17 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67410274 ``` - filter pushdown - boolean *** FAILED *** (249 milliseconds)``` I have no idea why this could happen. Should I rebase again? --- If your project is set up

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-17 Thread akopich
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-67415235 What do you mean by scaling tests? Tests measuring the dependence of computation time on numer of machines? Are there scaling tests for GraphX LDA implementations

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-17 Thread akopich
GitHub user akopich reopened a pull request: https://github.com/apache/spark/pull/1269 [SPARK-2199] [mllib] topic modeling I have implemented Probabilistic Latent Semantic Analysis (PLSA) and Robust PLSA with support of additive regularization (that actually means that I've

  1   2   >