[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2018-02-03 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh who shall we ping? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2018-02-03 Thread hhbyyh
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19565 It's probably better to wait for the opinion from a committer. --- - To unsubscribe, e-mail:

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2018-02-01 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh So, I guess, I should just roll the refactoring back, right? --- - To unsubscribe, e-mail:

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-13 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh, is there a cluster I can use for this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-12 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @WeichenXu123 @jkbradley said, pings on Git don't work for him... --- - To unsubscribe, e-mail:

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19565 ok I agree this change. @jkbradley Can you take a look ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-07 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 ping @WeichenXu123 , @srowen , @hhbyyh Further comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83330/ Test PASSed. ---

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-02 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83330 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83330/testReport)** for PR 19565 at commit

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-02 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83330/testReport)** for PR 19565 at commit

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-01 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Okay... any idea why tests failed? It says ```ERROR: Step ?Publish JUnit test result report? failed: No test report files were found. Configuration error?``` ---

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83301/ Test FAILed. ---

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-01 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @WeichenXu123, in a case of large dataset this "adjustment" would have infinitesimal effect. (IMO, no adjustment is needed -- the expected number of non-empty docs in the same and does not depend

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-01 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19565 Yes I think when dataset is large enough, using the same `miniBatchFraction`, the result RDD size of "filter before sample" and "filter after sample" will be asymptotically equal, no matter

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh OK, but it returns almost the same number of elements. Anyway, the variance is going to be much smaller that in the case with sample before filter. ---

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread hhbyyh
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19565 Let me know if I missed anything, but I don't quite catch the part > all the batches have the same length IMO `docs.sample(withReplacement = sampleWithReplacement,

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh Yes, in this way we don't change semantics of `miniBatchFraction`. But is the way it is defined now actually correct? As I mentioned above, in the `upstram/master` the number of

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread hhbyyh
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19565 @akopich I'm actually leaning towards "filter after sample". 1. so we don't need to change `miniBatchFraction` in ` docs.sample(withReplacement = sampleWithReplacement,

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Ping @hhbyyh, @WeichenXu123, @srowen. Seems like the discussion is stuck. Does anybody think that the general approach implemented in this PR should be changed? Currently it is filtering

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83088/ Test PASSed. ---

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83088 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83088/testReport)** for PR 19565 at commit

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh, in case of "filter before sample" in a local test the overhead is negligible. Regarding "sample before filter", you are right. There (strictly speaking) should be adjustment of

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83088 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83088/testReport)** for PR 19565 at commit

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread hhbyyh
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19565 I'm curious about the performance comparison, if "filter before sample" triggers a filter over the whole dataset for each `submitMiniBatch`, then there'll be some performance impact. And if

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19565 Yes, it changed the probability of samples indeed compared with current code. But according to the comments coming from @jkbradley in #18924 , "in order to make **corpusSize**, batchSize,

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 And the empty docs were not explicitly filtered out. They've just been ignored in `submitMiniBatch`. --- - To unsubscribe,

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 I'm saying they are not the same, but for larger datasets this should not matter. There is a change in logic. The hack with `val batchSize = (miniBatchFraction *

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19565 I agree, they're the same. You said at https://github.com/apache/spark/pull/19565#issuecomment-339638791 that they weren't. But if you're saying the code already filters out empty docs

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Consider the following scenario. Let `docs` be an RDD containing 1000 empty documents and 1000 non-empty documents and let `miniBatchFraction = 0.05`. Assume, we use

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19565 I assume not-selecting a record in a sample is cheaper than just about any other operation, including filtering on a predicate. All else equal, I'd rather sample, then evaluate a predicate on only

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @WeichenXu123, yes there indeed is a difference in logic. Eventually it boils down to semantics of `miniBatchFraction`. If it is a fraction of non-empty documents being sampled, the version with

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19565 Filtering after sampling makes more sense. Though sampling isn't deterministic, it doesn't change the probability that any particular sample is produced. ---

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19565 @akopich IMO the filter won't cost too much, don't worry about the performance. (Or you can make a test to make sure) ---

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 I am sure that caching may by avoided here. Hence, it should not be used. @srowen, maybe I don't get something, but I'm afraid, that currently lineage for a single mini-batch submission

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19565 Regarding caching: I think that can be ignored for purposes of this change. All this does is add a filter, and it doesn't cause an RDD to computed more than it was before. The only question

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19565 @akopich If you want to cache the input dataset, create JIAR to discuss it first. It's another issue I think. This JIAR also related to input caching issues:

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread hhbyyh
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19565 I wonder if we should add cache() for lda training data, even not for this feature. @srowen Not sure where we're on caching the training data or not for different algorithms. Appreciate

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83058/ Test PASSed. ---

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83058 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83058/testReport)** for PR 19565 at commit

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83058 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83058/testReport)** for PR 19565 at commit

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83045/ Test PASSed. ---

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83045 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83045/testReport)** for PR 19565 at commit

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83045 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83045/testReport)** for PR 19565 at commit

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Or (and I think it would be the most efficient approach) we can just stick in the check for emptiness of the document to the `seqOp` of `treeAggregate`. However, it doesn't look like "filtering out

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Now I feel that filtering empty docs out in the `initialize` is not a good idea, because it will be performed as many times, as the number of times `sample` in `next` gets called. Right?

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83040/ Test PASSed. ---

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83040 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83040/testReport)** for PR 19565 at commit

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83040/testReport)** for PR 19565 at commit

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83015/ Test PASSed. ---

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83015 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83015/testReport)** for PR 19565 at commit

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83015 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83015/testReport)** for PR 19565 at commit

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-24 Thread akopich
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @WeichenXu123, @hhbyyh, looking forward to your opinion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For