subject:"\[GitHub\] spark pull request\: \[Spark\] RDD take\(\) method\: overestimate too mu..."

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2648


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-13 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58947875
  
I just realized we didn't have a jira for this. Let's make sure we create a 
jira ticket tracking updates. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-13 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58947762
  
Merging in master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-13 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58925491
  
@yingjieMiao it looks good to me, waiting for other people.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-13 Thread yingjieMiao

Github user yingjieMiao commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58912546
  
@davies   OK to merge? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58600623
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21558/Test 
PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58600620
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21558/consoleFull)
 for   PR 2648 at commit 
[`d758218`](https://github.com/apache/spark/commit/d758218e286876877f31f9c00e0eb06ab8617e73).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58595846
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58581878
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/314/consoleFull)
 for   PR 2648 at commit 
[`a8e74bb`](https://github.com/apache/spark/commit/a8e74bb0015cde80877aa7ce839f1bb4446bce80).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58581751
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/314/consoleFull)
 for   PR 2648 at commit 
[`a8e74bb`](https://github.com/apache/spark/commit/a8e74bb0015cde80877aa7ce839f1bb4446bce80).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58572877
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread yingjieMiao

Github user yingjieMiao commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58566238
  
@davies  fixed. thank you. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58559393
  
it failed in python style check:
```
PEP 8 checks failed.
./python/pyspark/rdd.py:1077:21: E265 block comment should start with '# '
./python/pyspark/rdd.py:1078:101: E501 line too long (101 > 100 characters)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread yingjieMiao

Github user yingjieMiao commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58556305
  
@davies  style fixed. thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread yingjieMiao

Github user yingjieMiao commented on a diff in the pull request:

https://github.com/apache/spark/pull/2648#discussion_r18664325
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala ---
@@ -78,16 +78,17 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) 
extends Serializable with Loggi
 // greater than totalParts because we actually cap it at 
totalParts in runJob.
 var numPartsToTry = 1
 if (partsScanned > 0) {
-  // If we didn't find any rows after the first iteration, just 
try all partitions next.
+  // If we didn't find any rows after the previous iteration, 
quadruple and retry.
   // Otherwise, interpolate the number of partitions we need to 
try, but overestimate it
-  // by 50%.
+  // by 50%. We also cap the estimation in the end.
   if (results.size == 0) {
-numPartsToTry = totalParts - 1
+numPartsToTry = partsScanned * 4
   } else {
-numPartsToTry = (1.5 * num * partsScanned / results.size).toInt
+// the left side of max is >=1 whenever partsScanned >= 2
+numPartsToTry = ((1.5 * num * partsScanned / 
results.size).toInt - partsScanned) max 1
+numPartsToTry = numPartsToTry min (partsScanned * 4) 
--- End diff --

thank you for the pointer. Will fix!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2648#discussion_r18663950
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala ---
@@ -78,16 +78,17 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) 
extends Serializable with Loggi
 // greater than totalParts because we actually cap it at 
totalParts in runJob.
 var numPartsToTry = 1
 if (partsScanned > 0) {
-  // If we didn't find any rows after the first iteration, just 
try all partitions next.
+  // If we didn't find any rows after the previous iteration, 
quadruple and retry.
   // Otherwise, interpolate the number of partitions we need to 
try, but overestimate it
-  // by 50%.
+  // by 50%. We also cap the estimation in the end.
   if (results.size == 0) {
-numPartsToTry = totalParts - 1
+numPartsToTry = partsScanned * 4
   } else {
-numPartsToTry = (1.5 * num * partsScanned / results.size).toInt
+// the left side of max is >=1 whenever partsScanned >= 2
+numPartsToTry = ((1.5 * num * partsScanned / 
results.size).toInt - partsScanned) max 1
+numPartsToTry = numPartsToTry min (partsScanned * 4) 
--- End diff --

Infix Methods
Don't use infix notation for methods that aren't operators. For example, 
instead of list map func, use list.map(func), or instead of string contains 
"foo", use string.contains("foo"). This is to improve familiarity to developers 
coming from other languages.

https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58551768
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/302/consoleFull)
 for   PR 2648 at commit 
[`4391d3b`](https://github.com/apache/spark/commit/4391d3bc2f20fda20c0c14f14821ead8338c2435).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58551594
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/302/consoleFull)
 for   PR 2648 at commit 
[`4391d3b`](https://github.com/apache/spark/commit/4391d3bc2f20fda20c0c14f14821ead8338c2435).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58551193
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-09 Thread yingjieMiao

Github user yingjieMiao commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58550555
  
retest? @davies 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-07 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58261048
  
LGTM, thanks!

Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-07 Thread yingjieMiao

Github user yingjieMiao commented on a diff in the pull request:

https://github.com/apache/spark/pull/2648#discussion_r18544747
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala ---
@@ -78,16 +78,17 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) 
extends Serializable with Loggi
 // greater than totalParts because we actually cap it at 
totalParts in runJob.
 var numPartsToTry = 1
 if (partsScanned > 0) {
-  // If we didn't find any rows after the first iteration, just 
try all partitions next.
+  // If we didn't find any rows after the previous iteration, 
quadruple and retry.
   // Otherwise, interpolate the number of partitions we need to 
try, but overestimate it
-  // by 50%.
+  // by 50%. We also cap the estimation in the end.
   if (results.size == 0) {
-numPartsToTry = totalParts - 1
+numPartsToTry = totalParts * 4
   } else {
-numPartsToTry = (1.5 * num * partsScanned / results.size).toInt
+// the left side of max is >=1 whenever partsScanned >= 2
+numPartsToTry = ((1.5 * num * partsScanned / 
results.size).toInt - partsScanned) max 1
+numPartsToTry = numPartsToTry min (totalParts * 4) 
--- End diff --

fixed. good catch. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-07 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2648#discussion_r18544049
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala ---
@@ -78,16 +78,17 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) 
extends Serializable with Loggi
 // greater than totalParts because we actually cap it at 
totalParts in runJob.
 var numPartsToTry = 1
 if (partsScanned > 0) {
-  // If we didn't find any rows after the first iteration, just 
try all partitions next.
+  // If we didn't find any rows after the previous iteration, 
quadruple and retry.
   // Otherwise, interpolate the number of partitions we need to 
try, but overestimate it
-  // by 50%.
+  // by 50%. We also cap the estimation in the end.
   if (results.size == 0) {
-numPartsToTry = totalParts - 1
+numPartsToTry = totalParts * 4
   } else {
-numPartsToTry = (1.5 * num * partsScanned / results.size).toInt
+// the left side of max is >=1 whenever partsScanned >= 2
+numPartsToTry = ((1.5 * num * partsScanned / 
results.size).toInt - partsScanned) max 1
+numPartsToTry = numPartsToTry min (totalParts * 4) 
--- End diff --

same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-07 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2648#discussion_r18544040
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala ---
@@ -78,16 +78,17 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) 
extends Serializable with Loggi
 // greater than totalParts because we actually cap it at 
totalParts in runJob.
 var numPartsToTry = 1
 if (partsScanned > 0) {
-  // If we didn't find any rows after the first iteration, just 
try all partitions next.
+  // If we didn't find any rows after the previous iteration, 
quadruple and retry.
   // Otherwise, interpolate the number of partitions we need to 
try, but overestimate it
-  // by 50%.
+  // by 50%. We also cap the estimation in the end.
   if (results.size == 0) {
-numPartsToTry = totalParts - 1
+numPartsToTry = totalParts * 4
--- End diff --

totalParts should be partsScanned


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-07 Thread yingjieMiao

Github user yingjieMiao commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58247998
  
@davies  addressed your comments. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-07 Thread yingjieMiao

Github user yingjieMiao commented on a diff in the pull request:

https://github.com/apache/spark/pull/2648#discussion_r18540413
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1084,10 +1084,10 @@ abstract class RDD[T: ClassTag](
 if (buf.size == 0) {
   numPartsToTry = partsScanned * 4
 } else {
-  numPartsToTry = (1.5 * num * partsScanned / buf.size).toInt
+  // the left side of max is >=1 whenever partsScanned >= 2
+  numPartsToTry = ((1.5 * num * partsScanned / buf.size).toInt - 
partsScanned) max 1
--- End diff --

Sure. and `*4` isn't slow at all. Will do that cap. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-07 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2648#discussion_r18539447
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1084,10 +1084,10 @@ abstract class RDD[T: ClassTag](
 if (buf.size == 0) {
   numPartsToTry = partsScanned * 4
 } else {
-  numPartsToTry = (1.5 * num * partsScanned / buf.size).toInt
+  // the left side of max is >=1 whenever partsScanned >= 2
+  numPartsToTry = ((1.5 * num * partsScanned / buf.size).toInt - 
partsScanned) max 1
--- End diff --

Having a upper bound will be better for safety or stability, otherwise we 
will need to fix it some day, just like we had do for the cases that buf.size 
== 0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-07 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2648#discussion_r18539262
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala ---
@@ -84,10 +84,10 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) 
extends Serializable with Loggi
   if (results.size == 0) {
 numPartsToTry = totalParts - 1
--- End diff --

The take() was fixed in 
https://github.com/yingjieMiao/spark/commit/ba5bcaddecd54811d45c5fc79a013b3857d4c633,
 but AsyncRDDActions was missed in that patch, thanks for bringing this on top 
of the table.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-07 Thread yingjieMiao

Github user yingjieMiao commented on a diff in the pull request:

https://github.com/apache/spark/pull/2648#discussion_r18539122
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1084,10 +1084,10 @@ abstract class RDD[T: ClassTag](
 if (buf.size == 0) {
   numPartsToTry = partsScanned * 4
 } else {
-  numPartsToTry = (1.5 * num * partsScanned / buf.size).toInt
+  // the left side of max is >=1 whenever partsScanned >= 2
+  numPartsToTry = ((1.5 * num * partsScanned / buf.size).toInt - 
partsScanned) max 1
--- End diff --

If we assume some 'uniformness' in the distribution of items over 
partitions, then in your example, we probably have to try 1000 partitions. 
(uniformness --> linear prediction).  However, since I am really new to Spark, 
I have little idea about such distribution.

That being said, having a upper bound also makes sense. I am wondering how 
other people think about it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-07 Thread yingjieMiao

Github user yingjieMiao commented on a diff in the pull request:

https://github.com/apache/spark/pull/2648#discussion_r18538732
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala ---
@@ -84,10 +84,10 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) 
extends Serializable with Loggi
   if (results.size == 0) {
 numPartsToTry = totalParts - 1
--- End diff --

Sure, I can. The comment says: "If we didn't find any rows after the first 
iteration, just try all partitions next" . I had little context about these 
decisions. But I agree that we should keep logic equivalent in these methods. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-07 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2648#discussion_r18538556
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1084,10 +1084,10 @@ abstract class RDD[T: ClassTag](
 if (buf.size == 0) {
   numPartsToTry = partsScanned * 4
 } else {
-  numPartsToTry = (1.5 * num * partsScanned / buf.size).toInt
+  // the left side of max is >=1 whenever partsScanned >= 2
+  numPartsToTry = ((1.5 * num * partsScanned / buf.size).toInt - 
partsScanned) max 1
--- End diff --

In bad case that buf.size = 1, num = 1000, partsScanned = 1, then 
numPartsToTry will be 1000, so it's better to use partsScanned * 4 as the upper 
boundary of numPartsToTry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-07 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2648#discussion_r18538361
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala ---
@@ -84,10 +84,10 @@ class AsyncRDDActions[T: ClassTag](self: RDD[T]) 
extends Serializable with Loggi
   if (results.size == 0) {
 numPartsToTry = totalParts - 1
--- End diff --

Could you also change this to partsScanned * 4 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread yingjieMiao

Github user yingjieMiao commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58093248
  
@rxin 
previous build failed due to code style. Should be fixed. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread yingjieMiao

Github user yingjieMiao commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58089783
  
oops, looks like `Scalastyle checks failed at following occurrences: 
message=File line length exceeds 100 characters line=1087`. will fix that. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58087562
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21339/consoleFull)
 for   PR 2648 at commit 
[`1d2c410`](https://github.com/apache/spark/commit/1d2c4107c4240180358233d0c23614f663bcbf8e).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58087566
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21339/Test 
FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58087347
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21339/consoleFull)
 for   PR 2648 at commit 
[`1d2c410`](https://github.com/apache/spark/commit/1d2c4107c4240180358233d0c23614f663bcbf8e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58086404
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread yingjieMiao

Github user yingjieMiao commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58081521
  
@aarondav  

updated `rdd.py` and `AsyncRDDActions.scala`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread aarondav

Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58076032
  
Could you make this change in rdd.py as well? The code should be kept 
equivalent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread yingjieMiao

Github user yingjieMiao commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58066531
  
@rxin 
updated. Please see the comments. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread yingjieMiao

Github user yingjieMiao commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58064383
  
hmm... `(1.5 * num * partsScanned / buf.size).toInt >=  partsScanned + 1` 
whenever `partsScanned >= 2`. This fails when `partsScanned == 1`, which is 
exactly after 1 iteration. Will fix that. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-58057490
  
It seems like this leads to some infinite loop and tests are timing out 
because of that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-57997093
  
**[Tests timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/269/consoleFull)**
 for PR 2648 at commit 
[`a2aa36b`](https://github.com/apache/spark/commit/a2aa36b6838ff71941dab1d4af5c8e5f79fd4b4f)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-57996893
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21320/Test 
FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-57996887
  
**[Tests timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21320/consoleFull)**
 for PR 2648 at commit 
[`a2aa36b`](https://github.com/apache/spark/commit/a2aa36b6838ff71941dab1d4af5c8e5f79fd4b4f)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-57985492
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/269/consoleFull)
 for   PR 2648 at commit 
[`a2aa36b`](https://github.com/apache/spark/commit/a2aa36b6838ff71941dab1d4af5c8e5f79fd4b4f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-57985301
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21320/consoleFull)
 for   PR 2648 at commit 
[`a2aa36b`](https://github.com/apache/spark/commit/a2aa36b6838ff71941dab1d4af5c8e5f79fd4b4f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-57985280
  
Changes LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-06 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-57985017
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-05 Thread ash211

Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-57955322
  
This seems right to me yingjie.  Let's see if the tests work


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2648#issuecomment-57872445
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-03 Thread yingjieMiao

GitHub user yingjieMiao opened a pull request:

https://github.com/apache/spark/pull/2648

[Spark] RDD take() method: overestimate too much

In the comment (Line 1083), it says: "Otherwise, interpolate the number of 
partitions we need to try, but overestimate it by 50%."

`(1.5 * num * partsScanned / buf.size).toInt` is the guess of "num of total 
partitions needed". In every iteration, we should consider the increment `(1.5 
* num * partsScanned / buf.size).toInt - partsScanned`
Existing implementation 'exponentially' grows `partsScanned ` ( roughly: 
`x_{n+1} >= (1.5 + 1) x_n`)

This could be a performance problem. (unless this is the intended behavior)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yingjieMiao/spark rdd_take

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2648.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2648


commit a2aa36b6838ff71941dab1d4af5c8e5f79fd4b4f
Author: yingjieMiao 
Date:   2014-10-03T22:26:01Z

RDD take method: overestimate too much




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

53 matches

Mail list logo