Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh who shall we ping?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/19565
It's probably better to wait for the opinion from a committer.
---
-
To unsubscribe, e-mail:
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh So, I guess, I should just roll the refactoring back, right?
---
-
To unsubscribe, e-mail:
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh, is there a cluster I can use for this?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@WeichenXu123
@jkbradley said, pings on Git don't work for him...
---
-
To unsubscribe, e-mail:
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19565
ok I agree this change. @jkbradley Can you take a look ?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
ping @WeichenXu123 , @srowen , @hhbyyh
Further comments?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83330/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19565
**[Test build #83330 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83330/testReport)**
for PR 19565 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19565
**[Test build #83330 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83330/testReport)**
for PR 19565 at commit
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
Okay... any idea why tests failed? It says
```ERROR: Step ?Publish JUnit test result report? failed: No test report
files were found. Configuration error?```
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83301/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@WeichenXu123, in a case of large dataset this "adjustment" would have
infinitesimal effect. (IMO, no adjustment is needed -- the expected number of
non-empty docs in the same and does not depend
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19565
Yes I think when dataset is large enough, using the same
`miniBatchFraction`, the result RDD size of "filter before sample" and "filter
after sample" will be asymptotically equal, no matter
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh OK, but it returns almost the same number of elements. Anyway, the
variance is going to be much smaller that in the case with sample before filter.
---
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/19565
Let me know if I missed anything, but I don't quite catch the part
> all the batches have the same length
IMO
`docs.sample(withReplacement = sampleWithReplacement,
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh
Yes, in this way we don't change semantics of `miniBatchFraction`. But is
the way it is defined now actually correct? As I mentioned above, in the
`upstram/master` the number of
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/19565
@akopich I'm actually leaning towards "filter after sample".
1. so we don't need to change `miniBatchFraction` in
` docs.sample(withReplacement = sampleWithReplacement,
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
Ping @hhbyyh, @WeichenXu123, @srowen.
Seems like the discussion is stuck. Does anybody think that the general
approach implemented in this PR should be changed? Currently it is filtering
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83088/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19565
**[Test build #83088 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83088/testReport)**
for PR 19565 at commit
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@hhbyyh, in case of "filter before sample" in a local test the overhead is
negligible.
Regarding "sample before filter", you are right. There (strictly speaking)
should be adjustment of
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19565
**[Test build #83088 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83088/testReport)**
for PR 19565 at commit
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/19565
I'm curious about the performance comparison, if "filter before sample"
triggers a filter over the whole dataset for each `submitMiniBatch`, then
there'll be some performance impact.
And if
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19565
Yes, it changed the probability of samples indeed compared with current
code.
But according to the comments coming from @jkbradley in #18924 , "in order
to make **corpusSize**, batchSize,
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
And the empty docs were not explicitly filtered out. They've just been
ignored in `submitMiniBatch`.
---
-
To unsubscribe,
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
I'm saying they are not the same, but for larger datasets this should not
matter.
There is a change in logic. The hack with
`val batchSize = (miniBatchFraction *
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/19565
I agree, they're the same. You said at
https://github.com/apache/spark/pull/19565#issuecomment-339638791 that they
weren't.
But if you're saying the code already filters out empty docs
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
Consider the following scenario. Let `docs` be an RDD containing 1000 empty
documents and 1000 non-empty documents and let `miniBatchFraction = 0.05`.
Assume, we use
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/19565
I assume not-selecting a record in a sample is cheaper than just about any
other operation, including filtering on a predicate. All else equal, I'd rather
sample, then evaluate a predicate on only
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@WeichenXu123, yes there indeed is a difference in logic. Eventually it
boils down to semantics of `miniBatchFraction`. If it is a fraction of
non-empty documents being sampled, the version with
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/19565
Filtering after sampling makes more sense. Though sampling isn't
deterministic, it doesn't change the probability that any particular sample is
produced.
---
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19565
@akopich IMO the filter won't cost too much, don't worry about the
performance. (Or you can make a test to make sure)
---
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
I am sure that caching may by avoided here. Hence, it should not be used.
@srowen, maybe I don't get something, but I'm afraid, that currently
lineage for a single mini-batch submission
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/19565
Regarding caching: I think that can be ignored for purposes of this change.
All this does is add a filter, and it doesn't cause an RDD to computed more
than it was before.
The only question
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19565
@akopich If you want to cache the input dataset, create JIAR to discuss it
first. It's another issue I think. This JIAR also related to input caching
issues:
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/19565
I wonder if we should add cache() for lda training data, even not for this
feature.
@srowen Not sure where we're on caching the training data or not for
different algorithms. Appreciate
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83058/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19565
**[Test build #83058 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83058/testReport)**
for PR 19565 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19565
**[Test build #83058 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83058/testReport)**
for PR 19565 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83045/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19565
**[Test build #83045 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83045/testReport)**
for PR 19565 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19565
**[Test build #83045 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83045/testReport)**
for PR 19565 at commit
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
Or (and I think it would be the most efficient approach) we can just stick
in the check for emptiness of the document to the `seqOp` of `treeAggregate`.
However, it doesn't look like "filtering out
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
Now I feel that filtering empty docs out in the `initialize` is not a good
idea, because it will be performed as many times, as the number of times
`sample` in `next` gets called. Right?
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83040/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19565
**[Test build #83040 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83040/testReport)**
for PR 19565 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19565
**[Test build #83040 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83040/testReport)**
for PR 19565 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83015/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19565
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19565
**[Test build #83015 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83015/testReport)**
for PR 19565 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19565
**[Test build #83015 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83015/testReport)**
for PR 19565 at commit
Github user akopich commented on the issue:
https://github.com/apache/spark/pull/19565
@WeichenXu123, @hhbyyh, looking forward to your opinion.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
59 matches
Mail list logo