Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh who shall we ping?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh So, I guess, I should just roll the refactoring back, right?
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh, is there a cluster I can use for this?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@WeichenXu123
@jkbradley said, pings on Git don't work for him...
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
ping @WeichenXu123 , @srowen , @hhbyyh
Further comments?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r148517729
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r148507781
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -497,40 +481,46 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r148506477
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -497,40 +481,46 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
Okay... any idea why tests failed? It says
```ERROR: Step ?Publish JUnit test result report? failed: No test report
files were found. Configuration error
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@WeichenXu123, in a case of large dataset this "adjustment" would have
infinitesimal effect. (IMO, no adjustment is needed -- the expected number of
non-empty docs in the same and does
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh OK, but it returns almost the same number of elements. Anyway, the
variance is going to be much smaller that in the case with sample before filter
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh
Yes, in this way we don't change semantics of `miniBatchFraction`. But is
the way it is defined now actually correct? As I mentioned above, in the
`upstram/master` the number
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
Ping @hhbyyh, @WeichenXu123, @srowen.
Seems like the discussion is stuck. Does anybody think that the general
approach implemented in this PR should be changed? Currently it is filtering
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r147230366
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r147229232
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh, in case of "filter before sample" in a local test the overhead is
negligible.
Regarding "sample before filter", you are right. There (strictly speaking)
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r147208156
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r147208062
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
And the empty docs were not explicitly filtered out. They've just been
ignored in `submitMiniBatch`.
---
-
To unsubscribe, e
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
I'm saying they are not the same, but for larger datasets this should not
matter.
There is a change in logic. The hack with
`val batchSize = (miniBatchFraction * corpusSize).ceil.toInt
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
Consider the following scenario. Let `docs` be an RDD containing 1000 empty
documents and 1000 non-empty documents and let `miniBatchFraction = 0.05`.
Assume, we use `filter(...).sample
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@WeichenXu123, yes there indeed is a difference in logic. Eventually it
boils down to semantics of `miniBatchFraction`. If it is a fraction of
non-empty documents being sampled, the version
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
I am sure that caching may by avoided here. Hence, it should not be used.
@srowen, maybe I don't get something, but I'm afraid, that currently
lineage for a single mini-batch submission
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r147021726
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r146882424
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
Or (and I think it would be the most efficient approach) we can just stick
in the check for emptiness of the document to the `seqOp` of `treeAggregate`.
However, it doesn't look like "filterin
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
Now I feel that filtering empty docs out in the `initialize` is not a good
idea, because it will be performed as many times, as the number of times
`sample` in `next` gets called. Right
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r146820166
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r146812501
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r146804206
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r146572407
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r146571987
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -415,7 +415,8 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@WeichenXu123, @hhbyyh, looking forward to your opinion.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
GitHub user akopich opened a pull request:
https://github.com/apache/spark/pull/19565
[SPARK-22111][MLLIB] OnlineLDAOptimizer should filter out empty documents
beforehand
## What changes were proposed in this pull request?
The empty documents are filtered out
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@jkbradley, no problem.
@jkbradley, @WeichenXu123, @hhbyyh, thank you all guys!
---
-
To unsubscribe, e-mail: reviews
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
ping @jkbradley. Anyway, tests are passed now.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@jkbradley, no problem. The test build seems to be aborted. What's wrong?
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@WeichenXu123, no problem! Thank you.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@WeichenXu123, yes sure. But can this wait until this PR is merged?
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@WeichenXu123, could you please notify @jkbradley once again?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
So shall we ping @jkbradley, shan't we?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143159334
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143084875
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -503,21 +533,22 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143084656
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143069049
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143066229
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143064794
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143060674
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143060537
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
I have conducted some performance testing with random data.
The new implementation turns out to be notably faster.
```
OLD with hyper-parameter optimization : 237 sec
OLD
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143003890
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +462,54 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
Thank you, @hhbyyh.
I have augmented the example a bit: explicitly set random seed a nd chosen
online optimizer:
`val lda = new
LDA().setK(10).setMaxIter(10).setOptimizer("o
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@jkbradley, thank you!
- Correctness: in order to test the equivalence of two versions of
`submitMiniBatch` I have to bring both of them into the scope... One solution
would be to derive
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
BTW. Seems like `updateLambda` method relies (in older version as well) on
`batchSize` only because this is `an optimization to avoid batch.count`.
Shouldn't we rather use `nonEmptyDocsN` instead
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@hhbyyh, this change does not target performance but scalability, and I am
afraid, the change is beneficial only for huge datasets and the tests would
require massive computational resources
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@WeichenXu123. thank you
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142632240
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142625490
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142624984
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142624246
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142624340
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142624093
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142622117
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142620788
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@WeichenXu123, the PR seems to receive no attention for 10 days now... What
should I do?
---
-
To unsubscribe, e-mail: reviews
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@WeichenXu123, @jkbradley, talking of merging. Is there anything else I
should improve in this PR in order for it to be mergeable
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@WeichenXu123, thanks for creating Jira. Yes, sure I will work on it.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@jkbradley, thanks for the comments. Who is supposed to create the followup
jira?
---
-
To unsubscribe, e-mail: reviews
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r140630215
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r140199136
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r140193380
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r140183412
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r140180799
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@jkbradley, thank you for your comments! Please, check out the commit
adding the necessary docs.
Regarding tests: I believe, `OnlineLDAOptimizer alpha hyperparameter
optimization` from
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r140032198
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r140031900
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +462,46 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@WeichenXu123, thank you for your prompt reply!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r139514402
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +462,44 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r139514301
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +462,44 @@ final class OnlineLDAOptimizer extends
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
Ping @jkbradley .
Thank you @WeichenXu123 one again for the comment! Please, have a look.
---
-
To unsubscribe, e-mail
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
Yes, sure. Thank you for the valuable comment. Hopefully, I'll update the
code this week.
---
-
To unsubscribe, e-mail: reviews
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/18924
@feynmanliang , @hhbyyh, @WeichenXu123, could you please review the PR?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
GitHub user akopich opened a pull request:
https://github.com/apache/spark/pull/18924
[SPARK-14371] [MLLIB] OnlineLDAOptimizer should not collect stats for each
doc in mini-batch to driver
Hi,
as it was proposed by Joseph K. Bradley, gammat are not collected
Github user akopich commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-78050367
@renchengchang
1. Hi.
2. Don't use code from this PR. Use either LDA (which is merged with mllib)
or https://github.com/akopich/dplsa which is a further
Github user akopich commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-78184948
@renchengchang
What do you mean by topic vector? A vector of p(t|d) \forall t? If so,
you can find these vectors in `RDD[DocumentParameters]` which is returned
Github user akopich closed the pull request at:
https://github.com/apache/spark/pull/1269
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501440
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501548
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user akopich commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-69560296
@jkbradley, @mengxr, please, include @IlyaKozlov as author too. He's helped
a lot with the implementation.
---
If your project is set up for it, you can
Github user akopich commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-67643630
I've performed sanity check on the dataset i've described above.
PLSA: tm project obtains perplexity of `2358` and this implementation ends
up with `2311
Github user akopich commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-67656496
And tests fail again in obscure manner...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user akopich commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-67661902
I've fixed perplexity for robust plsa and updates perplexity value in the
comment above. Now they are almost the same.
---
If your project is set up for it, you can
Github user akopich commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-67664969
By the way. May be it's off top, but this is related to initial
approximation generation.
Suppose, one has `indxs : RDD[Int]` and is about to create an RDD
Github user akopich commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-67493934
How do you compare accuracy? Perplexity means nothing but perplexity --
topic models may be reliably compared only via application task (e.g.
classification
Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/1269#discussion_r22003692
--- Diff: mllib/pom.xml ---
@@ -112,6 +112,11 @@
typetest-jar/type
scopetest/scope
/dependency
+dependency
Github user akopich commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-67399691
@jkbradley Thank you for explanation about setters.
tm implementation was tested (it was succesfully used in one of my project)
but it was tested with scala 2.11
Github user akopich commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-67410274
``` - filter pushdown - boolean *** FAILED *** (249 milliseconds)```
I have no idea why this could happen. Should I rebase again?
---
If your project is set up
Github user akopich commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-67415235
What do you mean by scaling tests? Tests measuring the dependence of
computation time on numer of machines? Are there scaling tests for GraphX LDA
implementations
GitHub user akopich reopened a pull request:
https://github.com/apache/spark/pull/1269
[SPARK-2199] [mllib] topic modeling
I have implemented Probabilistic Latent Semantic Analysis (PLSA) and Robust
PLSA with support of additive regularization (that actually means that I've
1 - 100 of 125 matches
Mail list logo