[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67619776 [Test build #24638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24638/consoleFull) for PR 3702 at commit

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67627757 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67627749 [Test build #24638 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24638/consoleFull) for PR 3702 at commit

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-18 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67467841 Yes, just talking about oversampling now. In 1, if you mean ceil(rdd.count / numBins) then yes that's basically what I've got now. You won't quite get numBins back, yes.

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-18 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67534470 Right, that's what I meant to write for 1. Option 1 sounds good to me, with a little documentation. Thanks a lot! --- If your project is set up for it, you can

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67566900 [Test build #24603 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24603/consoleFull) for PR 3702 at commit

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67574804 [Test build #24603 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24603/consoleFull) for PR 3702 at commit

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67574815 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67584027 [Test build #24613 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24613/consoleFull) for PR 3702 at commit

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-6750 [Test build #24613 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24613/consoleFull) for PR 3702 at commit

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-6758 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-17 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67318108 @jkbradley Hm, I'm wondering whether it's even worth the time to count the partition sizes. The number of bins is intended to be large relative to the number of

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-17 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67376219 +1 for not bothering to count partition sizes. I'd say it's your call about whether to oversample or not (to allow a little more precision), as long as we document

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-17 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67417721 Hm I might be missing you point but if just taking every nth point, then the number of points taken from each partition will be correct to +/- 1 already. You get sample a

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-17 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67428833 We're in agreement. My earlier statement The simplistic approach should never be off by more than numPartitions. meant that the total count would never be off by more

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-16 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67152810 @jkbradley Yes let's do `numBins`, I'm changing it now. Yeah, say you have 100 elements in 10 partitions, and want to sample down to 12. That means sampling about every

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-16 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67223829 Yep, that's what I meant. I think it would be extra code, but I don't think it would affect the runtime that much. (One pass to collect the number of elements in

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-15 Thread srowen
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/3702 SPARK-4547 [MLLIB] [WIP] OOM when making bins in BinaryClassificationMetrics Now that I've implemented the basics here, I'm less convinced there is a need for this change, somehow. Callers can

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-15 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67019824 [Test build #24461 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24461/consoleFull) for PR 3702 at commit

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-15 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67030795 [Test build #24461 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24461/consoleFull) for PR 3702 at commit

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67030807 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-15 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67089687 @srowen +1 for this functionality. It sounds handy for experts and necessary for beginner users. What do you think of using ```numBins``` instead of

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-15 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67106963 @jkbradley Sure, well my thinking was that there is a nice straightforward approach based on sampling every Nth point, so the natural thing is to add a parameter for this

[GitHub] spark pull request: SPARK-4547 [MLLIB] [WIP] OOM when making bins ...

2014-12-15 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3702#issuecomment-67117735 @srowen Trying to guarantee exactly the requested number of points does seem like more trouble than it is worth. It might require collecting the # of points in each