Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67619776
[Test build #24638 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24638/consoleFull)
for PR 3702 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67627757
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67627749
[Test build #24638 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24638/consoleFull)
for PR 3702 at commit
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67467841
Yes, just talking about oversampling now. In 1, if you mean ceil(rdd.count
/ numBins) then yes that's basically what I've got now. You won't quite get
numBins back, yes.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67534470
Right, that's what I meant to write for 1. Option 1 sounds good to me,
with a little documentation. Thanks a lot!
---
If your project is set up for it, you can
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67566900
[Test build #24603 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24603/consoleFull)
for PR 3702 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67574804
[Test build #24603 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24603/consoleFull)
for PR 3702 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67574815
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67584027
[Test build #24613 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24613/consoleFull)
for PR 3702 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-6750
[Test build #24613 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24613/consoleFull)
for PR 3702 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-6758
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67318108
@jkbradley Hm, I'm wondering whether it's even worth the time to count the
partition sizes. The number of bins is intended to be large relative to the
number of
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67376219
+1 for not bothering to count partition sizes. I'd say it's your call
about whether to oversample or not (to allow a little more precision), as long
as we document
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67417721
Hm I might be missing you point but if just taking every nth point, then
the number of points taken from each partition will be correct to +/- 1
already. You get sample a
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67428833
We're in agreement. My earlier statement The simplistic approach should
never be off by more than numPartitions. meant that the total count would
never be off by more
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67152810
@jkbradley Yes let's do `numBins`, I'm changing it now. Yeah, say you have
100 elements in 10 partitions, and want to sample down to 12. That means
sampling about every
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67223829
Yep, that's what I meant. I think it would be extra code, but I don't
think it would affect the runtime that much. (One pass to collect the number
of elements in
GitHub user srowen opened a pull request:
https://github.com/apache/spark/pull/3702
SPARK-4547 [MLLIB] [WIP] OOM when making bins in BinaryClassificationMetrics
Now that I've implemented the basics here, I'm less convinced there is a
need for this change, somehow. Callers can
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67019824
[Test build #24461 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24461/consoleFull)
for PR 3702 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67030795
[Test build #24461 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24461/consoleFull)
for PR 3702 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67030807
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67089687
@srowen +1 for this functionality. It sounds handy for experts and
necessary for beginner users.
What do you think of using ```numBins``` instead of
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67106963
@jkbradley Sure, well my thinking was that there is a nice straightforward
approach based on sampling every Nth point, so the natural thing is to add a
parameter for this
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67117735
@srowen Trying to guarantee exactly the requested number of points does
seem like more trouble than it is worth. It might require collecting the # of
points in each
24 matches
Mail list logo