[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2648 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58947875 I just realized we didn't have a jira for this. Let's make sure we create a jira ticket tracking updates. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58947762 Merging in master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58925491 @yingjieMiao it looks good to me, waiting for other people. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58912546 @davies OK to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58600623 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21558/Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58600620 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21558/consoleFull) for PR 2648 at commit [`d758218`](https://github.com/apache/spark/commit/d758218e286876877f31f9c00e0eb06ab8617e73). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58595846 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58581878 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/314/consoleFull) for PR 2648 at commit [`a8e74bb`](https://github.com/apache/spark/commit/a8e74bb0015cde80877aa7ce839f1bb4446bce80). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58581751 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/314/consoleFull) for PR 2648 at commit [`a8e74bb`](https://github.com/apache/spark/commit/a8e74bb0015cde80877aa7ce839f1bb4446bce80). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58572877 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58566238 @davies fixed. thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58559393 it failed in python style check: ``` PEP 8 checks failed. ./python/pyspark/rdd.py:1077:21: E265 block comment should start with '# ' ./python/pyspark/rdd.py:1078:101: E501 line too long (101 > 100 characters) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58556305 @davies style fixed. thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on a diff in the pull request: https://github.com/apache/spark/pull/2648#discussion_r18664325 --- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala --- @@ -78,16 +78,17 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi // greater than totalParts because we actually cap it at totalParts in runJob. var numPartsToTry = 1 if (partsScanned > 0) { - // If we didn't find any rows after the first iteration, just try all partitions next. + // If we didn't find any rows after the previous iteration, quadruple and retry. // Otherwise, interpolate the number of partitions we need to try, but overestimate it - // by 50%. + // by 50%. We also cap the estimation in the end. if (results.size == 0) { -numPartsToTry = totalParts - 1 +numPartsToTry = partsScanned * 4 } else { -numPartsToTry = (1.5 * num * partsScanned / results.size).toInt +// the left side of max is >=1 whenever partsScanned >= 2 +numPartsToTry = ((1.5 * num * partsScanned / results.size).toInt - partsScanned) max 1 +numPartsToTry = numPartsToTry min (partsScanned * 4) --- End diff -- thank you for the pointer. Will fix! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2648#discussion_r18663950 --- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala --- @@ -78,16 +78,17 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi // greater than totalParts because we actually cap it at totalParts in runJob. var numPartsToTry = 1 if (partsScanned > 0) { - // If we didn't find any rows after the first iteration, just try all partitions next. + // If we didn't find any rows after the previous iteration, quadruple and retry. // Otherwise, interpolate the number of partitions we need to try, but overestimate it - // by 50%. + // by 50%. We also cap the estimation in the end. if (results.size == 0) { -numPartsToTry = totalParts - 1 +numPartsToTry = partsScanned * 4 } else { -numPartsToTry = (1.5 * num * partsScanned / results.size).toInt +// the left side of max is >=1 whenever partsScanned >= 2 +numPartsToTry = ((1.5 * num * partsScanned / results.size).toInt - partsScanned) max 1 +numPartsToTry = numPartsToTry min (partsScanned * 4) --- End diff -- Infix Methods Don't use infix notation for methods that aren't operators. For example, instead of list map func, use list.map(func), or instead of string contains "foo", use string.contains("foo"). This is to improve familiarity to developers coming from other languages. https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58551768 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/302/consoleFull) for PR 2648 at commit [`4391d3b`](https://github.com/apache/spark/commit/4391d3bc2f20fda20c0c14f14821ead8338c2435). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58551594 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/302/consoleFull) for PR 2648 at commit [`4391d3b`](https://github.com/apache/spark/commit/4391d3bc2f20fda20c0c14f14821ead8338c2435). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58551193 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58550555 retest? @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58261048 LGTM, thanks! Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on a diff in the pull request: https://github.com/apache/spark/pull/2648#discussion_r18544747 --- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala --- @@ -78,16 +78,17 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi // greater than totalParts because we actually cap it at totalParts in runJob. var numPartsToTry = 1 if (partsScanned > 0) { - // If we didn't find any rows after the first iteration, just try all partitions next. + // If we didn't find any rows after the previous iteration, quadruple and retry. // Otherwise, interpolate the number of partitions we need to try, but overestimate it - // by 50%. + // by 50%. We also cap the estimation in the end. if (results.size == 0) { -numPartsToTry = totalParts - 1 +numPartsToTry = totalParts * 4 } else { -numPartsToTry = (1.5 * num * partsScanned / results.size).toInt +// the left side of max is >=1 whenever partsScanned >= 2 +numPartsToTry = ((1.5 * num * partsScanned / results.size).toInt - partsScanned) max 1 +numPartsToTry = numPartsToTry min (totalParts * 4) --- End diff -- fixed. good catch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2648#discussion_r18544049 --- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala --- @@ -78,16 +78,17 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi // greater than totalParts because we actually cap it at totalParts in runJob. var numPartsToTry = 1 if (partsScanned > 0) { - // If we didn't find any rows after the first iteration, just try all partitions next. + // If we didn't find any rows after the previous iteration, quadruple and retry. // Otherwise, interpolate the number of partitions we need to try, but overestimate it - // by 50%. + // by 50%. We also cap the estimation in the end. if (results.size == 0) { -numPartsToTry = totalParts - 1 +numPartsToTry = totalParts * 4 } else { -numPartsToTry = (1.5 * num * partsScanned / results.size).toInt +// the left side of max is >=1 whenever partsScanned >= 2 +numPartsToTry = ((1.5 * num * partsScanned / results.size).toInt - partsScanned) max 1 +numPartsToTry = numPartsToTry min (totalParts * 4) --- End diff -- same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2648#discussion_r18544040 --- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala --- @@ -78,16 +78,17 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi // greater than totalParts because we actually cap it at totalParts in runJob. var numPartsToTry = 1 if (partsScanned > 0) { - // If we didn't find any rows after the first iteration, just try all partitions next. + // If we didn't find any rows after the previous iteration, quadruple and retry. // Otherwise, interpolate the number of partitions we need to try, but overestimate it - // by 50%. + // by 50%. We also cap the estimation in the end. if (results.size == 0) { -numPartsToTry = totalParts - 1 +numPartsToTry = totalParts * 4 --- End diff -- totalParts should be partsScanned --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58247998 @davies addressed your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on a diff in the pull request: https://github.com/apache/spark/pull/2648#discussion_r18540413 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1084,10 +1084,10 @@ abstract class RDD[T: ClassTag]( if (buf.size == 0) { numPartsToTry = partsScanned * 4 } else { - numPartsToTry = (1.5 * num * partsScanned / buf.size).toInt + // the left side of max is >=1 whenever partsScanned >= 2 + numPartsToTry = ((1.5 * num * partsScanned / buf.size).toInt - partsScanned) max 1 --- End diff -- Sure. and `*4` isn't slow at all. Will do that cap. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2648#discussion_r18539447 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1084,10 +1084,10 @@ abstract class RDD[T: ClassTag]( if (buf.size == 0) { numPartsToTry = partsScanned * 4 } else { - numPartsToTry = (1.5 * num * partsScanned / buf.size).toInt + // the left side of max is >=1 whenever partsScanned >= 2 + numPartsToTry = ((1.5 * num * partsScanned / buf.size).toInt - partsScanned) max 1 --- End diff -- Having a upper bound will be better for safety or stability, otherwise we will need to fix it some day, just like we had do for the cases that buf.size == 0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2648#discussion_r18539262 --- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala --- @@ -84,10 +84,10 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi if (results.size == 0) { numPartsToTry = totalParts - 1 --- End diff -- The take() was fixed in https://github.com/yingjieMiao/spark/commit/ba5bcaddecd54811d45c5fc79a013b3857d4c633, but AsyncRDDActions was missed in that patch, thanks for bringing this on top of the table. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on a diff in the pull request: https://github.com/apache/spark/pull/2648#discussion_r18539122 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1084,10 +1084,10 @@ abstract class RDD[T: ClassTag]( if (buf.size == 0) { numPartsToTry = partsScanned * 4 } else { - numPartsToTry = (1.5 * num * partsScanned / buf.size).toInt + // the left side of max is >=1 whenever partsScanned >= 2 + numPartsToTry = ((1.5 * num * partsScanned / buf.size).toInt - partsScanned) max 1 --- End diff -- If we assume some 'uniformness' in the distribution of items over partitions, then in your example, we probably have to try 1000 partitions. (uniformness --> linear prediction). However, since I am really new to Spark, I have little idea about such distribution. That being said, having a upper bound also makes sense. I am wondering how other people think about it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on a diff in the pull request: https://github.com/apache/spark/pull/2648#discussion_r18538732 --- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala --- @@ -84,10 +84,10 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi if (results.size == 0) { numPartsToTry = totalParts - 1 --- End diff -- Sure, I can. The comment says: "If we didn't find any rows after the first iteration, just try all partitions next" . I had little context about these decisions. But I agree that we should keep logic equivalent in these methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2648#discussion_r18538556 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1084,10 +1084,10 @@ abstract class RDD[T: ClassTag]( if (buf.size == 0) { numPartsToTry = partsScanned * 4 } else { - numPartsToTry = (1.5 * num * partsScanned / buf.size).toInt + // the left side of max is >=1 whenever partsScanned >= 2 + numPartsToTry = ((1.5 * num * partsScanned / buf.size).toInt - partsScanned) max 1 --- End diff -- In bad case that buf.size = 1, num = 1000, partsScanned = 1, then numPartsToTry will be 1000, so it's better to use partsScanned * 4 as the upper boundary of numPartsToTry. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2648#discussion_r18538361 --- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala --- @@ -84,10 +84,10 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) extends Serializable with Loggi if (results.size == 0) { numPartsToTry = totalParts - 1 --- End diff -- Could you also change this to partsScanned * 4 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58093248 @rxin previous build failed due to code style. Should be fixed. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58089783 oops, looks like `Scalastyle checks failed at following occurrences: message=File line length exceeds 100 characters line=1087`. will fix that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58087562 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21339/consoleFull) for PR 2648 at commit [`1d2c410`](https://github.com/apache/spark/commit/1d2c4107c4240180358233d0c23614f663bcbf8e). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58087566 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21339/Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58087347 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21339/consoleFull) for PR 2648 at commit [`1d2c410`](https://github.com/apache/spark/commit/1d2c4107c4240180358233d0c23614f663bcbf8e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58086404 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58081521 @aarondav updated `rdd.py` and `AsyncRDDActions.scala` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58076032 Could you make this change in rdd.py as well? The code should be kept equivalent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58066531 @rxin updated. Please see the comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user yingjieMiao commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58064383 hmm... `(1.5 * num * partsScanned / buf.size).toInt >= partsScanned + 1` whenever `partsScanned >= 2`. This fails when `partsScanned == 1`, which is exactly after 1 iteration. Will fix that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58057490 It seems like this leads to some infinite loop and tests are timing out because of that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-57997093 **[Tests timed out](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/269/consoleFull)** for PR 2648 at commit [`a2aa36b`](https://github.com/apache/spark/commit/a2aa36b6838ff71941dab1d4af5c8e5f79fd4b4f) after a configured wait of `120m`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-57996893 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21320/Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-57996887 **[Tests timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21320/consoleFull)** for PR 2648 at commit [`a2aa36b`](https://github.com/apache/spark/commit/a2aa36b6838ff71941dab1d4af5c8e5f79fd4b4f) after a configured wait of `120m`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-57985492 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/269/consoleFull) for PR 2648 at commit [`a2aa36b`](https://github.com/apache/spark/commit/a2aa36b6838ff71941dab1d4af5c8e5f79fd4b4f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-57985301 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21320/consoleFull) for PR 2648 at commit [`a2aa36b`](https://github.com/apache/spark/commit/a2aa36b6838ff71941dab1d4af5c8e5f79fd4b4f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-57985280 Changes LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-57985017 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user ash211 commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-57955322 This seems right to me yingjie. Let's see if the tests work --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-57872445 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...
GitHub user yingjieMiao opened a pull request: https://github.com/apache/spark/pull/2648 [Spark] RDD take() method: overestimate too much In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%." `(1.5 * num * partsScanned / buf.size).toInt` is the guess of "num of total partitions needed". In every iteration, we should consider the increment `(1.5 * num * partsScanned / buf.size).toInt - partsScanned` Existing implementation 'exponentially' grows `partsScanned ` ( roughly: `x_{n+1} >= (1.5 + 1) x_n`) This could be a performance problem. (unless this is the intended behavior) You can merge this pull request into a Git repository by running: $ git pull https://github.com/yingjieMiao/spark rdd_take Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2648.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2648 commit a2aa36b6838ff71941dab1d4af5c8e5f79fd4b4f Author: yingjieMiao Date: 2014-10-03T22:26:01Z RDD take method: overestimate too much --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org