[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-12-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3794


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-12-30 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-168112504
  

I'm going to close this pull request. If this is still relevant and you are 
interested in pushing it forward, please open a new pull request. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-12-15 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-164882462
  
@markhamstra @kayousterhout could you have a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-06-27 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-116156525
  
In #7002, I added message processing time metrics to DAGScheduler using 
Codahale metrics, so it should now be much easier to benchmark this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-06-13 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-111743100
  
FYI, this is on my list of old PRs / issues to revisit in the medium-term.  
I'm also considering adding some instrumentation to DAGScheduler to make this 
type of blocking / slowdown easier to discover; see 
https://issues.apache.org/jira/browse/SPARK-8344


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-04-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-92250794
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30147/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-04-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-92250773
  
  [Test build #30147 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30147/consoleFull)
 for   PR 3794 at commit 
[`d5c0e84`](https://github.com/apache/spark/commit/d5c0e846b8515cb52625814e785056449dbbb07d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-04-12 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-92227329
  
  [Test build #30147 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30147/consoleFull)
 for   PR 3794 at commit 
[`d5c0e84`](https://github.com/apache/spark/commit/d5c0e846b8515cb52625814e785056449dbbb07d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-27 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-71761428
  
/cc @marbrus, since you mentioned seeing this issue before.  Do you think 
the proposal of having our own DAG traversal outside of DAGScheduler + calling 
`partitions` there will fix the case that you encountered?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-25 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-71405599
  
@JoshRosen I don't think just calling rdd.partitions on the final RDD could 
achieve our goal. Furthermore, rdd.partitions has been called before:
470 // Check to make sure we are not launching a task on a partition that 
does not exist.
471 val maxPartitions = rdd.partitions.length
However, it does not work for some scene like the example contrived by me.
To avoid thread-safety issue, do you think we could use another method to 
get parent stages which does not mutate any global map, or we could  just use 
another method like getParentPartitions committed by me before to get 
partitions directly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-25 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3794#discussion_r23507001
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -483,6 +483,20 @@ class DAGScheduler(
 assert(partitions.size > 0)
 val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
 val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
+
+// The reason for performing this call here is that computing the 
partitions
+// may be very expensive for certain types of RDDs (e.g. HadoopRDDs), 
so therefore
+// we'd like that computation to take place outside of the 
DAGScheduler to avoid
+// blocking its event processing loop. See SPARK-4961 for details.
+try {
+  getParentStages(rdd, jobId).foreach(_.rdd.partitions)
--- End diff --

I just realized that this could be a thread-safety issue: `getParentStages` 
could call `getShuffleMapStage`, which mutates a non-thread-safe 
`shuffleToMapStage` map.  Even if that map was synchronized, we could still 
have race-conditions between calls from the event processing loop and external 
calls.

Do you think we could just call `rdd.partitions` on the final RDD (e.g. the 
`rdd` local variable here) instead of calling `getParentStages`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-24 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-71308409
  
@JoshRosen I've brought this up to date with master. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-71307242
  
  [Test build #26042 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26042/consoleFull)
 for   PR 3794 at commit 
[`d5c0e84`](https://github.com/apache/spark/commit/d5c0e846b8515cb52625814e785056449dbbb07d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-71307244
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26042/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-71306578
  
  [Test build #26041 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26041/consoleFull)
 for   PR 3794 at commit 
[`267e375`](https://github.com/apache/spark/commit/267e3751e485be733e1a24f62391f72e6078d5a0).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-71306584
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26041/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-71305101
  
  [Test build #26042 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26042/consoleFull)
 for   PR 3794 at commit 
[`d5c0e84`](https://github.com/apache/spark/commit/d5c0e846b8515cb52625814e785056449dbbb07d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-71304162
  
  [Test build #26041 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26041/consoleFull)
 for   PR 3794 at commit 
[`267e375`](https://github.com/apache/spark/commit/267e3751e485be733e1a24f62391f72e6078d5a0).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-22 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-71155707
  
This approach looks good to me, so feel free to bring this up to date with 
master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-22 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3794#discussion_r23434730
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -495,6 +495,19 @@ class DAGScheduler(
 assert(partitions.size > 0)
 val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
 val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
+
+// Makes sure that getPartitions occurs before
+// the job submitter sends a message into the DAGScheduler actor.
--- End diff --

I'd expand this comment to explain that the reason for performing this call 
here is that computing the partitions may be very expensive for certain types 
of RDDs (e.g. HadoopRDDs), so therefore we'd like that computation to take 
place outside of the DAGScheduler to avoid blocking its event processing loop.  
I'd also mention SPARK-4961 so that it's easier to find more context on JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-20 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-70628086
  
@JoshRosen Thanks. I've updated it as your comments. Please review again. 
However, these's merge  conflicts. I will resolve this conflict if this 
approach is passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-70597066
  
  [Test build #25784 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25784/consoleFull)
 for   PR 3794 at commit 
[`aed530b`](https://github.com/apache/spark/commit/aed530b31481e3f8ed007ee0abf99a9180d4342d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-70597070
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25784/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-70591879
  
  [Test build #25784 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25784/consoleFull)
 for   PR 3794 at commit 
[`aed530b`](https://github.com/apache/spark/commit/aed530b31481e3f8ed007ee0abf99a9180d4342d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-19 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-70548146
  
Good catch on the error-handling logic.

> I directly use getParentStages which will call RDD's getPartitions before 
sending JobSubmitted event.

Does this really call `.partitions`?  It looks like `getParentStages` just 
looks at RDDs' dependencies.  I was suggesting something more like using 
`getParentStages` to get the list of RDDs, then explicitly doing 
`_.foreach(_.partitions)` on that list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-19 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-70481411
  
@JoshRosen Thanks for your comments. I've updates it. I directly use 
getParentStages which will call RDD's getPartitions before sending JobSubmitted 
event. Is it ok?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-70457855
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25745/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-70457851
  
  [Test build #25745 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25745/consoleFull)
 for   PR 3794 at commit 
[`b535a53`](https://github.com/apache/spark/commit/b535a531ee853c29d63cda0154be54512740bc78).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-70452982
  
  [Test build #25745 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25745/consoleFull)
 for   PR 3794 at commit 
[`b535a53`](https://github.com/apache/spark/commit/b535a531ee853c29d63cda0154be54512740bc78).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-14 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-69916653
  
@JoshRosen I've updated it. Please review again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-69899323
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25536/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-69899316
  
  [Test build #25536 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25536/consoleFull)
 for   PR 3794 at commit 
[`09afdff`](https://github.com/apache/spark/commit/09afdff3f8e511c947fc14cd4673b1629c105f41).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-69891440
  
  [Test build #25536 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25536/consoleFull)
 for   PR 3794 at commit 
[`09afdff`](https://github.com/apache/spark/commit/09afdff3f8e511c947fc14cd4673b1629c105f41).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-06 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68967210
  
It looks like this might be legitimately causing a problem in BagelSuite?  
I'll see if I can reproduce this locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68821077
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25076/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68821076
  
**[Test build #25076 timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25076/consoleFull)**
 for PR 3794 at commit 
[`74c1dec`](https://github.com/apache/spark/commit/74c1dec31ba9ded5a82640f7354aa6231169281c)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68812400
  
  [Test build #25076 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25076/consoleFull)
 for   PR 3794 at commit 
[`74c1dec`](https://github.com/apache/spark/commit/74c1dec31ba9ded5a82640f7354aa6231169281c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2015-01-05 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68812196
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68443755
  
**[Test build #24958 timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24958/consoleFull)**
 for PR 3794 at commit 
[`74c1dec`](https://github.com/apache/spark/commit/74c1dec31ba9ded5a82640f7354aa6231169281c)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68443757
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24958/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68441201
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24957/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68441198
  
**[Test build #24957 timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24957/consoleFull)**
 for PR 3794 at commit 
[`fd87518`](https://github.com/apache/spark/commit/fd87518d7f81de1a122cfad25a88956a596ccd4f)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68438540
  
  [Test build #24958 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24958/consoleFull)
 for   PR 3794 at commit 
[`74c1dec`](https://github.com/apache/spark/commit/74c1dec31ba9ded5a82640f7354aa6231169281c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-31 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68438167
  
@JoshRosen Thanks for your comments. I've updated it according to your 
comments and contrived a simple example as follows:
```javascript
val inputfile1 = "./testin/in_1.txt"
val inputfile2 = "./testin/in_2.txt"
val tempfile = "./testtmp"
val outputfile = "./testout"
val sc = new SparkContext(new SparkConf())
sc.textFile(inputfile1)
  .flatMap(line => line.split(" "))
  .map(word => (word, 1))
  .reduceByKey(_ + _, 1)
  .map{kv => (kv._1 + "," + kv._2.toString)}
  .saveAsTextFile(tempfile)
val wordCounts1 = sc.textFile(tempfile)
val wordCounts2 = sc.textFile(inputfile2)
val wordCounts = wordCounts1.union(wordCounts2)
wordCounts.map{line =>
val kv = line.split(",")
(kv(0), Integer.parseInt(kv(1)))
   }
   .reduceByKey(_ + _, 1)
   .map{kv => (kv._1 + "," + kv._2.toString)}
   .saveAsTextFile(outputfile)
```
./testin/in_1.txt (23 bytes) and ./testin/in_2.txt (19 bytes) are all local 
files.
- Before optimization,
 - job1
   New stage creation took 0.729638 s among which HadoopRDD 
getPartitions took 0.710247 s.
 - job2
   New stage creation took 0.882241 s among which 
HadoopRDD.getPartitions took 0.850668 + 0.023490 s.
- After optimization,
 - job1
   HadoopRDD getPartitions took 0.802133 s.
   New stage creation took 0.029328 s.
 - job2
   HadoopRDD getPartitions took 0.464713 + 0.022568 s.
   New stage creation took 0.001773 s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68436420
  
  [Test build #24957 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24957/consoleFull)
 for   PR 3794 at commit 
[`fd87518`](https://github.com/apache/spark/commit/fd87518d7f81de1a122cfad25a88956a596ccd4f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-31 Thread YanTangZhai
Github user YanTangZhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/3794#discussion_r22376680
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -178,7 +178,7 @@ abstract class RDD[T: ClassTag](
   // Our dependencies and partitions will be gotten by calling subclass's 
methods below, and will
   // be overwritten when we're checkpointed
   private var dependencies_ : Seq[Dependency[_]] = null
-  @transient private var partitions_ : Array[Partition] = null
+  @transient private var partitions_ : Array[Partition] = getPartitions
--- End diff --

Sorry. This approach may cause error as follows:
Exception in thread "main" java.lang.NullPointerException
at 
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:191)
at 
com.google.common.collect.MapMakerInternalMap.put(MapMakerInternalMap.java:3499)
at 
org.apache.spark.rdd.HadoopRDD$.putCachedMetadata(HadoopRDD.scala:273)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:151)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:173)
at org.apache.spark.rdd.RDD.(RDD.scala:181)
at org.apache.spark.rdd.HadoopRDD.(HadoopRDD.scala:97)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:561)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:471)
since jobConfCacheKey has not been initialized at that time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-30 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68385773
  
@JoshRosen If you've given it a look over and don't see a conflict, then 
you are probably right about the orthogonality.  I was asking out of caution 
rather than knowledge or a definite expectation of a conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-30 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68382391
  
@markhamstra 

> How would this interact with the idea of @erikerlandson to defer 
partition computation?
#3079

Maybe I'm overlooking something, but #3079 seems kind of orthogonal.  It 
seems like that issue is concerned with making the `sortByKey` transformation 
lazy so that it does not eagerly trigger a Spark job to compute the range 
partition boundaries, whereas this pull request is related to eager vs. lazy 
evaluation of what's effectively a Hadoop filesystem metadata call.

Maybe eager vs. lazy is the wrong way to think about this PR's issue, 
though, since I guess we're more concerned with _where_ the call is performed 
(blocking DAGScheduler's event loop vs. a driver user-code thread) than when 
it's performed.  I suppose that maybe you could contrive an example where this 
patch changes the behavior of a user job, since maybe someone defines some 
transformations up-front, runs jobs to generate output, then reads it back in 
another RDD, in which case the data to be read might not exist at the time that 
the RDD is defined but will exist when the first action on it is invoked.  So, 
maybe we should consider moving the first `partitions` call closer to the 
DAGScheduler's job submission methods, but not inside of the actor (e.g. don't 
change any code in `RDD`, but just add a call that traverses the lineage chain 
and calls `partitions` on each RDD, making sure that this call occurs before 
the job submitter sends a message into the DAGScheduler actor).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-30 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3794#discussion_r22357840
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala ---
@@ -46,6 +47,7 @@ private[spark] class BinaryFileRDD[T](
 for (i <- 0 until rawSplits.size) {
   result(i) = new NewHadoopPartition(id, i, 
rawSplits(i).asInstanceOf[InputSplit with Writable])
 }
+logDebug("Get these partitions took %f s".format((System.nanoTime - 
start) / 1e9))
--- End diff --

Since this `getPartitions` method is guaranteed to only be called once, I 
think we can just move this logging to its call site in `RDD.scala` (e.g. add a 
block near where we assign to `partitions_`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-30 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3794#discussion_r22357607
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -202,9 +202,6 @@ abstract class RDD[T: ClassTag](
*/
   final def partitions: Array[Partition] = {
 checkpointRDD.map(_.partitions).getOrElse {
-  if (partitions_ == null) {
--- End diff --

Won't this now throw a NPE if we call `partitions` from a worker, since now 
this will return `null` after the RDD is serialized and deserialized?  I guess 
maybe we never do that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68361347
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24892/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68361338
  
**[Test build #24892 timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24892/consoleFull)**
 for PR 3794 at commit 
[`6e95955`](https://github.com/apache/spark/commit/6e95955c9c67ce509372fe08f9ced962eb251593)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68353372
  
  [Test build #24892 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24892/consoleFull)
 for   PR 3794 at commit 
[`6e95955`](https://github.com/apache/spark/commit/6e95955c9c67ce509372fe08f9ced962eb251593).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-29 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/3794#discussion_r22336168
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -203,9 +204,27 @@ class HadoopRDD[K, V](
 for (i <- 0 until inputSplits.size) {
   array(i) = new HadoopPartition(id, i, inputSplits(i))
 }
+logDebug("Get these partitions took %f s".format((System.nanoTime - 
start) / 1e9))
 array
   }
 
+  @transient private var thesePartitions_ : Array[Partition] = {
+try {
+  getThesePartitions()
+} catch {
+  case e: Exception => 
+logDebug("Error initializing HadoopRDD's partitions", e)
+null
--- End diff --

> It seems like the fix in this patch is to force partitions to be 
eagerly-computed in the driver thread that defines the RDD. This seems like a 
good idea

How would this interact with the idea of @erikerlandson to defer partition 
computation?
https://github.com/apache/spark/pull/3079


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-29 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3794#discussion_r22325585
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -203,9 +204,27 @@ class HadoopRDD[K, V](
 for (i <- 0 until inputSplits.size) {
   array(i) = new HadoopPartition(id, i, inputSplits(i))
 }
+logDebug("Get these partitions took %f s".format((System.nanoTime - 
start) / 1e9))
 array
   }
 
+  @transient private var thesePartitions_ : Array[Partition] = {
+try {
+  getThesePartitions()
+} catch {
+  case e: Exception => 
+logDebug("Error initializing HadoopRDD's partitions", e)
+null
--- End diff --

(This comment is kind of moot since I proposed a more general fix in a 
top-level comment, but I'll still post it anyways:)

I don't think that logging an exception at debug level then returning 
`null` is a good error-handling strategy; this is likely to cause a confusing 
NPE somewhere else with no obvious cause since most users won't have 
debug-level logging enabled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-29 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68295291
  
To maybe summarize the motivation a bit more succinctly, it seems like the 
problem here is that the first call to `rdd.partitions` might be expensive and 
might occur inside the DAGScheduler event loop, blocking the entire scheduler.  
I guess this is an unfortunate side-effect of laziness: we might have expensive 
lazy initialization but it can be hard to reason about when/where it will 
occur, causing difficult-to-diagnose performance bottlenecks.

It seems like the fix in this patch is to force `partitions` to be 
eagerly-computed in the driver thread that defines the RDD.  This seems like a 
good idea, but I have a few minor nits with the fix as it's currently 
implemented:

- I understand that the motivation for this is HadoopRDD's expensive 
`getPartitions` method, but it seems like the problem is potentially more 
general.  Is there any way to handle this `RDD` instead?  I understand that we 
can't just make `partitions` into a `val`, but it looks like the `@transient 
partitions_` logic is already there in `RDD`, so maybe we could just toss a 
`self.partitions()` call into the `RDD` constructor to force eager evaluation 
on the driver?
- If there's some reason that we can't implement my proposal in `RDD`, then 
I think we can just add a call to `self.partitions() `at the end of HadoopRDD; 
this would eliminate the need for a bunch of the confusing variable names added 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-29 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68294158
  
To reformat the PR description to make it a little easier to read:

> HadoopRDD.getPartitions is lazyied to process in 
DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much 
time.  For example, in our cluster, it needs from 0.029s to 766.699s. If one 
JobSubmitted event is processing, others should wait. Thus, we want to put 
HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing 
time. Then other JobSubmitted event don't need to wait much time. HadoopRDD 
object could get its partitons when it is instantiated.
> 
> We could analyse and compare the execution time before and after 
optimization.
> ```
> TaskScheduler.start execution time: [time1__]
> DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
TaskScheduler.start) execution time: [time2_]
> HadoopRDD.getPartitions execution time: [time3___]
> Stages execution time: [time4_].
> ```
> (1) The app has only one job
> (a)
> ```
> The execution time of the job before optimization is 
[time1__][time2_][time3___][time4_].
> The execution time of the job after optimization 
is[time1__][time3___][time2_][time4_].
> ```
> In summary, if the app has only one job, the total execution time is same 
before and after optimization.
> (2) The app has 4 jobs
> (a) Before optimization,
> ```
> job1 execution time is [time2_][time3___][time4_],
> job2 execution time is [time2__][time3___][time4_],
> job3 execution time 
is[time2][time3___][time4_],
> job4 execution time 
is[time2_][time3___][time4_].
> ```
> After optimization, 
> ```
> job1 execution time is [time3___][time2_][time4_],
> job2 execution time is [time3___][time2__][time4_],
> job3 execution time 
is[time3___][time2_][time4_],
> job4 execution time 
is[time3___][time2__][time4_].
> ```
> In summary, if the app has multiple jobs, average execution time after 
optimization is less than before.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68091858
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24810/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68091852
  
  [Test build #24810 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24810/consoleFull)
 for   PR 3794 at commit 
[`af5abda`](https://github.com/apache/spark/commit/af5abdacaf5637e05200216bd8dfcdcfe15b4e17).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68089096
  
  [Test build #24810 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24810/consoleFull)
 for   PR 3794 at commit 
[`af5abda`](https://github.com/apache/spark/commit/af5abdacaf5637e05200216bd8dfcdcfe15b4e17).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-24 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68086560
  
This looks like a legitimate test failure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68086434
  
  [Test build #24805 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24805/consoleFull)
 for   PR 3794 at commit 
[`5601a8b`](https://github.com/apache/spark/commit/5601a8b1458c9a7317a2e4e0463358f0a054c181).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68086437
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24805/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3794#issuecomment-68084876
  
  [Test build #24805 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24805/consoleFull)
 for   PR 3794 at commit 
[`5601a8b`](https://github.com/apache/spark/commit/5601a8b1458c9a7317a2e4e0463358f0a054c181).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

2014-12-24 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/3794

[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce 
DAGScheduler.JobSubmitted processing time

HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. 
If inputdir is large, getPartitions may spend much time.
For example, in our cluster, it needs from 0.029s to 766.699s. If one 
JobSubmitted event is processing, others should wait. Thus, we
want to put HadoopRDD.getPartitions forward to reduce 
DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
need to wait much time. HadoopRDD object could get its partitons when it is 
instantiated.
We could analyse and compare the execution time before and after 
optimization.
TaskScheduler.start execution time: [time1__]
DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
TaskScheduler.start) execution time: [time2_]
HadoopRDD.getPartitions execution time: [time3___]
Stages execution time: [time4_]
(1) The app has only one job
(a)
The execution time of the job before optimization is 
[time1__][time2_][time3___][time4_].
The execution time of the job after optimization 
is[time1__][time3___][time2_][time4_].
In summary, if the app has only one job, the total execution time is same 
before and after optimization.
(2) The app has 4 jobs
(a) Before optimization,
job1 execution time is [time2_][time3___][time4_],
job2 execution time is [time2__][time3___][time4_],
job3 execution time 
is[time2][time3___][time4_],
job4 execution time 
is[time2_][time3___][time4_].
After optimization, 
job1 execution time is [time3___][time2_][time4_],
job2 execution time is [time3___][time2__][time4_],
job3 execution time 
is[time3___][time2_][time4_],
job4 execution time 
is[time3___][time2__][time4_].
In summary, if the app has multiple jobs, average execution time after 
optimization is less than before.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-4961

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3794.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3794


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai 
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai 
Date:   2014-08-20T13:14:08Z

Merge pull request #3 from apache/master

Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai 
Date:   2014-09-12T06:54:58Z

Merge pull request #6 from apache/master

Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai 
Date:   2014-09-16T12:03:22Z

Merge pull request #7 from apache/master

Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai 
Date:   2014-10-20T12:52:22Z

Merge pull request #8 from apache/master

update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai 
Date:   2014-11-04T09:00:31Z

Merge pull request #9 from apache/master

Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai 
Date:   2014-11-11T03:18:24Z

Merge pull request #10 from apache/master

Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai 
Date:   2014-12-01T11:23:56Z

Merge pull request #11 from apache/master

Update

commit 718afebe364bd54ac33be425e24183eb1c76b5d3
Author: YanTangZhai 
Date:   2014-12-05T11:08:31Z

Merge pull request #12 from apache/master

update

commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5
Author: YanTangZhai 
Date:   2014-12-24T03:15:22Z

Merge pull request #15 from apache/master

update

commit 5601a8b1458c9a7317a2e4e0463358f0a054c181
Author: yantangzhai 
Date:   2014-12-25T03:17:57Z

[SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce 
DAGScheduler.JobSubmitted processing time




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org