[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-12-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3079


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-12-17 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-67382250
  
Do please reopen though once you having something that is passing tests :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-12-17 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-67382177
  
hi @erikerlandson, thanks for working on this.  It would be great to have a 
solution to this long running problem.  Since it looks like there is still some 
work to be done, I propose we close this issue for now.  We are on a renewed 
mission to keep the PR queue small and limited to active PRs so that things 
don't fall through the cracks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-20 Thread erikerlandson
Github user erikerlandson commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-63881800
  
For reference, this other issue has some overlap:
https://issues.apache.org/jira/browse/SPARK-4514



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-09 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/3079#discussion_r20062337
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -113,8 +117,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   private var ordering = implicitly[Ordering[K]]
 
   // An array of upper bounds for the first (partitions - 1) partitions
-  private var rangeBounds: Array[K] = {
-if (partitions <= 1) {
+  @volatile private var valRB: Array[K] = null
--- End diff --

`valRD` is a kinda confusing name.  I think the convention would be to name 
it `_rangeBounds`.   Eg.


https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/FutureAction.scala#L111


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-61719969
  
  [Test build #22892 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22892/consoleFull)
 for   PR 3079 at commit 
[`2183325`](https://github.com/apache/spark/commit/2183325884c69878184fcf55d368339379269f35).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-61719975
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22892/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-61704937
  
  [Test build #22892 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22892/consoleFull)
 for   PR 3079 at commit 
[`2183325`](https://github.com/apache/spark/commit/2183325884c69878184fcf55d368339379269f35).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-61675446
  
  [Test build #22880 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22880/consoleFull)
 for   PR 3079 at commit 
[`0fc30fe`](https://github.com/apache/spark/commit/0fc30fe411056090e4e18c96beeb3049ee750b24).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-61675448
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22880/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-61675397
  
  [Test build #22880 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22880/consoleFull)
 for   PR 3079 at commit 
[`0fc30fe`](https://github.com/apache/spark/commit/0fc30fe411056090e4e18c96beeb3049ee750b24).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-61565392
  
  [Test build #22828 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22828/consoleFull)
 for   PR 3079 at commit 
[`019ac27`](https://github.com/apache/spark/commit/019ac27689132cdd0a7858259b168b9c91d3ba7a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-61565401
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22828/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-61556754
  
@erikerlandson I think you also need -Phive for the tests to run.  It is 
possible some other things changed (or even that that test case changed with 
the upgrade to hive 13).  Perhaps you can include a dummy change to the Hive 
code to make sure those tests are run in this PR so we can see what Jenkins 
thinks?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-61556278
  
  [Test build #22828 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22828/consoleFull)
 for   PR 3079 at commit 
[`019ac27`](https://github.com/apache/spark/commit/019ac27689132cdd0a7858259b168b9c91d3ba7a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread erikerlandson
Github user erikerlandson commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-61556289
  
@marmbrus, @scwf,  FWIW, the `correlationoptimizer14` test appears to be 
working for me. I ran it using: `env _RUN_SQL_TESTS=true _SQL_TESTS_ONLY=true 
./dev/run-tests > ~/run-tests-1021.txt 2>&1`

Not sure why, but running `sbt 
-Dspark.hive.whitelist=correlationoptimizer14 hive/test was not causing the 
test to run in my environment`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread erikerlandson
Github user erikerlandson commented on the pull request:

https://github.com/apache/spark/pull/3079#issuecomment-61555496
  
Reboot of #1689


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread erikerlandson
GitHub user erikerlandson opened a pull request:

https://github.com/apache/spark/pull/3079

[SPARK-1021] Defer the data-driven computation of partition bounds in so...

...rtByKey() until evaluation.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/erikerlandson/spark spark-1021-pr

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3079.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3079


commit 019ac27689132cdd0a7858259b168b9c91d3ba7a
Author: Erik Erlandson 
Date:   2014-07-30T22:59:27Z

[SPARK-1021] Defer the data-driven computation of partition bounds in 
sortByKey() until evaluation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread erikerlandson
Github user erikerlandson commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-61508261
  
@marmbrus, FWIW, the `correlationoptimizer14` test appears to be working 
for me.   I ran it using: `env _RUN_SQL_TESTS=true _SQL_TESTS_ONLY=true 
./dev/run-tests > ~/run-tests-1021.txt 2>&1`

Not sure why, but running`sbt -Dspark.hive.whitelist=correlationoptimizer14 
hive/test` was not causing the test to run in my environment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-28 Thread erikerlandson
Github user erikerlandson commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-57110142
  
@rxin @marmbrus I will check it out


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-28 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-57108427
  
I reverted this commit. @erikerlandson mind taking a look at this problem?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-28 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-57106886
  
Since this PR was merged the correlationoptimizer14 test has been hanging.  
We might want to consider rolling back.  You can reproduce the problem as 
follows: `sbt -Dspark.hive.whitelist=correlationoptimizer14 hive/test`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-26 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-57043930
  
Have either of you thought about how to coordinate this with Josh's work on 
SPARK-3626? https://github.com/apache/spark/pull/2482


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-26 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-57043862
  
BTW one thing that would be great to add is a test that makes sure we don't 
block the main dag scheduler thread. The reason I think we don't block is that 
we call rdd.partitions.length in submitJob:

```scala

  /**
   * Submit a job to the job scheduler and get a JobWaiter object back. The 
JobWaiter object
   * can be used to block until the the job finishes executing or can be 
used to cancel the job.
   */
  def submitJob[T, U](
  rdd: RDD[T],
  func: (TaskContext, Iterator[T]) => U,
  partitions: Seq[Int],
  callSite: CallSite,
  allowLocal: Boolean,
  resultHandler: (Int, U) => Unit,
  properties: Properties = null): JobWaiter[U] =
  {
// Check to make sure we are not launching a task on a partition that 
does not exist.
val maxPartitions = rdd.partitions.length
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1689


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-26 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-57043822
  
@erikerlandson i'm going to merge this first. Maybe we can do the cleanup 
later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1689#discussion_r18122214
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   private var ordering = implicitly[Ordering[K]]
 
   // An array of upper bounds for the first (partitions - 1) partitions
-  private var rangeBounds: Array[K] = {
-if (partitions <= 1) {
+  @volatile private var valRB: Array[K] = null
--- End diff --

Maybe we can rename valRB to _rangeBounds and use this directly in 
getPartitions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1689#discussion_r18122212
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   private var ordering = implicitly[Ordering[K]]
 
   // An array of upper bounds for the first (partitions - 1) partitions
-  private var rangeBounds: Array[K] = {
-if (partitions <= 1) {
+  @volatile private var valRB: Array[K] = null
--- End diff --

this is going to be called once for every record on workers actually. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1689#discussion_r18122197
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -222,7 +228,8 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   }
 
   @throws(classOf[IOException])
-  private def readObject(in: ObjectInputStream) {
+  private def readObject(in: ObjectInputStream): Unit = this.synchronized {
+if (valRB != null) return
--- End diff --

that's not possible


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-26 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-57043705
  
Actually I looked at it again. I don't think it would block the scheduler 
because we compute partitions outside the scheduler thread. This approach looks 
good to me! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-16 Thread erikerlandson
Github user erikerlandson commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-55805772
  
So far the best idea I have for (2) is to set some kind of time-out on the 
evaluation.   The bound computation uses subsampling that will (when all goes 
well) cap the computation at constant time.  If the timeout triggers, some 
sub-optimal falback for partitioning might be used.  Or just fail the entire 
evaluation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-55797086
  
Yea I don't think we need to fully solve 3 here.

My main concern with these set of changes is 2, since a single badly 
behaved RDD can potentially block the (unfortunately single threaded) scheduler 
forever. Let me think about this a little bit and get back to you.

If you have an idea about how to fix that, feel free to suggest them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-15 Thread erikerlandson
Github user erikerlandson commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-55628401
  
Or, maybe just look into playing the same game with the cogrouped RDDs that 
I did with sortByKey.   Don't get into invoking `defaultPartitioner` until 
somebody asks for the output RDD's partitioner, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-15 Thread erikerlandson
Github user erikerlandson commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-55627362
  
Hi @rxin,

1) SimpleFutureAction is still referred to in submitJob method, but that 
doesn't appear to be invoked anywhere.  I was reluctant to get rid of it, as 
it's all experimental, and I could envision use cases for it.

2) I see your point.  I don't currently have any clever ideas to avoid that 
scenario when it happens.

3) Very interesting -- so this scenario is triggered because 
`defaultPartitioner` starts examining input RDD partitioners, which sets off 
the job when it trips over the data driven partitioning computation from 
`sortByKey`.

My impression is that this whack-a-mole with non-laziness stems from a 
combination of (a) a data-dependent partitioner(s), with (b) methods that refer 
to input partitioners as part of the construction of new RDDs.   It *might* be 
possible to thread some design changes around so that references to 
partitioning are consistently encapsulated in a Future.  Functions such as 
`defaultPartitioner` would then also have to return a Future, etc.  Or, even 
more generally, somehow encapsulate *all* RDD initialization in a Future, with 
the idea that these futures would finally unwind when some Action was invoked.  

However it seems (imo) outside the scope of this particular Jira/PR.  Maybe 
we could start another umbrella Jira to track possible solutions along these 
lines.

Another orthogonal thought -- you can short circuit all this by providing a 
partitioner instead of forcing it to be computed from data.  That's not as 
sexy, or widely applicable, as some deeper fix to the problem, but users can do 
it now as a workaround when it's feasible.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-12 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-55464403
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20236/consoleFull)
 for   PR 1689 at commit 
[`50b6da6`](https://github.com/apache/spark/commit/50b6da6234188a147508654b08e6b67cbf01fbec).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-12 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-55458226
  
@erikerlandson  thanks for looking at this.

A few questions:

1. After this pull request, does anything still use SimpleFutureAction?
2. If I understand this correctly, this could potentially block the 
single-threaded scheduler from doing anything else while waiting for the 
rangeBounds to be computed. Any comment on this?
3. This is not always lazy still right? See a test case
```scala
c.parallelize(1 to 1000).map(x => (x, x)).sortByKey().join(sc.parallelize(1 
to 10).map(x=>(x,x)))
```





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-12 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-55457077
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20236/consoleFull)
 for   PR 1689 at commit 
[`50b6da6`](https://github.com/apache/spark/commit/50b6da6234188a147508654b08e6b67cbf01fbec).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-12 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-55456438
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-54694535
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-52400243
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18675/consoleFull)
 for   PR 1689 at commit 
[`f3448e4`](https://github.com/apache/spark/commit/f3448e47b671570cc95c99f809b68b9382c5cd1b).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-52397817
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18675/consoleFull)
 for   PR 1689 at commit 
[`f3448e4`](https://github.com/apache/spark/commit/f3448e47b671570cc95c99f809b68b9382c5cd1b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-52342401
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18615/consoleFull)
 for   PR 1689 at commit 
[`09f0637`](https://github.com/apache/spark/commit/09f0637ac5ff986701d76c874b6567313022a0ab).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-52339006
  
Excellent!  I'll try to find some time to review this soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread erikerlandson
Github user erikerlandson commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-52336202
  
Latest push updates RangePartition sampling job to be async, and updates 
the async action functions so that they will properly enclose the sampling job 
induced by calling 'partitions'.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-52336221
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18615/consoleFull)
 for   PR 1689 at commit 
[`09f0637`](https://github.com/apache/spark/commit/09f0637ac5ff986701d76c874b6567313022a0ab).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-07 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/1689#discussion_r15931660
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -222,7 +228,8 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   }
 
   @throws(classOf[IOException])
-  private def readObject(in: ObjectInputStream) {
+  private def readObject(in: ObjectInputStream): Unit = this.synchronized {
+if (valRB != null) return
--- End diff --

also was assuming readObject might be called in multiple threads.   Can 
that happen?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-07 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/1689#discussion_r15931609
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   private var ordering = implicitly[Ordering[K]]
 
   // An array of upper bounds for the first (partitions - 1) partitions
-  private var rangeBounds: Array[K] = {
-if (partitions <= 1) {
+  @volatile private var valRB: Array[K] = null
--- End diff --

My understanding of getPartitions was that it executes once, and is 
therefore "allowed to be expensive".  Also, isn't rangeBounds generally only 
returning a reference to the array?  (except for the first time, where it's 
computed)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/1689#discussion_r15920203
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   private var ordering = implicitly[Ordering[K]]
 
   // An array of upper bounds for the first (partitions - 1) partitions
-  private var rangeBounds: Array[K] = {
-if (partitions <= 1) {
+  @volatile private var valRB: Array[K] = null
--- End diff --

It wouldn't surprise me if this performance figure varied with different 
combinations of hardware and Java version; but for at least one such 
combination, volatile reads are roughly 2-3x as costly as non-volatile reads as 
long as they are uncontended -- much more expensive when there are concurrent 
writes to contend with. http://brooker.co.za/blog/2012/09/10/volatile.html 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1689#discussion_r15919599
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   private var ordering = implicitly[Ordering[K]]
 
   // An array of upper bounds for the first (partitions - 1) partitions
-  private var rangeBounds: Array[K] = {
-if (partitions <= 1) {
+  @volatile private var valRB: Array[K] = null
--- End diff --

Any idea on volatile's impact on read performance? rangeBounds is read 
multiple times in getPartition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1689#discussion_r15919352
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -222,7 +228,8 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   }
 
   @throws(classOf[IOException])
-  private def readObject(in: ObjectInputStream) {
+  private def readObject(in: ObjectInputStream): Unit = this.synchronized {
+if (valRB != null) return
--- End diff --

Do we not want to deserialize valRB if it is not null? Are you worried 
rangeBounds might be called while the deserialization is happening? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-51424177
  
QA results for PR 1689:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18089/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-51421389
  
QA tests have started for PR 1689. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18089/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1689#discussion_r15900503
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -113,8 +113,13 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   private var ordering = implicitly[Ordering[K]]
 
   // An array of upper bounds for the first (partitions - 1) partitions
-  private var rangeBounds: Array[K] = {
-if (partitions <= 1) {
+  private var valRB: Array[K] = Array()
--- End diff --

Can we perhaps make this thread safe? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-50829158
  
QA results for PR 1689:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17611/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-50824621
  
QA tests have started for PR 1689. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17611/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-07-31 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-50824343
  
Jenkins, this is ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-07-31 Thread erikerlandson
GitHub user erikerlandson opened a pull request:

https://github.com/apache/spark/pull/1689

[SPARK-1021] Defer the data-driven computation of partition bounds in so...

...rtByKey() until evaluation.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/erikerlandson/spark spark-1021-pr

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1689.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1689


commit fa9bbca42d423b3cc9b10063073f7ab7507b1922
Author: Erik Erlandson 
Date:   2014-07-30T22:59:27Z

[SPARK-1021] Defer the data-driven computation of partition bounds in 
sortByKey() until evaluation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-07-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-50765803
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---