[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2015-01-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1697


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2015-01-19 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-70468626
  
Let's close this issue pending an update from @shivaram (just doing some 
JIRA clean-up).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-12-23 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-68015465
  
Sure. I'll bring this up to date, put it behind a config flag this week and 
ping the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-12-23 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-68011432
  
I agree with @rxin; I'd be totally fine with including this an an 
experimental feature, perhaps opt-in while we test it (like we did with 
sort-based shuffle).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-11-02 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-61425923
  
Can we bring this up to date, and:

1. Add a switch to turn it on / off

2. Add a config option to disable this automatically when num reduce / map 
tasks is greater than a certain threshold?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-10-27 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-60624723
  
Ping @rxin  -- Any thoughts on this ? I can merge to upstream and it'll be 
great to have this in 1.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-50948548
  
QA results for PR 1697:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17715/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-08-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-50945636
  
QA tests have started for PR 1697. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17715/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-08-01 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-50930865
  
One more thing we can do is to coalesce sizes from all tasks on a machine 
and only do node-level locality. As map outputs are on disk there shouldn't be 
any difference for node vs. process level locality ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-08-01 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-50930229
  
I ran some microbenchmarks as outlined at 
https://gist.github.com/shivaram/63620c47f0ad50106e0a
The comments below the gist have some numbers that I got on my laptop.

Overall I think we should just use a upper bound on the number of map tasks 
and not return any preferred locations if we have more than say 1000 map tasks. 
There might be some more optimization we can do in terms of filtering out zeros 
etc. but a simple heuristic might be a good and safe start for now.

@rxin Thoughts ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-50847052
  
QA results for PR 1697:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17631/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-50844964
  
QA tests have started for PR 1697. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17631/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-07-31 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-50844818
  
I switched to using Guava's ordering function now and added another unit 
test for that. I plan to do a microbenchmark to see how long it takes to get 
top 5 from a list of Longs.

@kayousterhout -- Is there a way to benchmark the scheduler to see if a 
change introduces any performance regressions ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-07-31 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/1697#discussion_r15681031
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -284,6 +290,24 @@ private[spark] class MapOutputTrackerMaster(conf: 
SparkConf)
 cachedSerializedStatuses.contains(shuffleId) || 
mapStatuses.contains(shuffleId)
   }
 
+  // Return the list of locations and blockSizes for each reducer.
+  def getStatusByReducer(shuffleId: Int): Option[Map[Int, 
Array[(BlockManagerId, Long)]]] = {
--- End diff --

Added comments -- This method is not thread safe as TimestampedHashMap is 
not thread safe. However we only call this from DAGScheduler which is single 
threaded AFAIK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-07-31 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/1697#discussion_r15681022
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1152,6 +1155,18 @@ class DAGScheduler(
 return locs
   }
 }
+  case s: ShuffleDependency[_, _, _] =>
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-07-31 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-50831954
  
Thanks for taking a look -- One thing I realized is that we only need top-5 
and don't need to sort the data. I'll try to use the Guava Ordering class and 
do some benchmarks 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-07-31 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-50831248
  
I have some concern (maybe unfounded) about runtime. If we have 50k map 
tasks and 10k reduce tasks, this would reduce doing 10k sort, each on 50k items 
right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-07-31 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1697#discussion_r15675968
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -232,6 +232,11 @@ private[spark] class MapOutputTrackerMaster(conf: 
SparkConf)
   protected val mapStatuses = new TimeStampedHashMap[Int, 
Array[MapStatus]]()
   private val cachedSerializedStatuses = new TimeStampedHashMap[Int, 
Array[Byte]]()
 
+  // For each shuffleId we also maintain a Map from reducerId -> 
(location, size)
+  // Lazily populated whenever the statuses are requested from DAGScheduler
+  private val statusByReducer =
+new TimeStampedHashMap[Int, HashMap[Int, Array[(BlockManagerId, 
Long)]]]()
--- End diff --

should we consider sampling the map tasks to speed up the sort?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-07-31 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1697#discussion_r15675840
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1152,6 +1155,18 @@ class DAGScheduler(
 return locs
   }
 }
+  case s: ShuffleDependency[_, _, _] =>
--- End diff --

add some inline comment explaining this case


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-07-31 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1697#discussion_r15675815
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -284,6 +290,24 @@ private[spark] class MapOutputTrackerMaster(conf: 
SparkConf)
 cachedSerializedStatuses.contains(shuffleId) || 
mapStatuses.contains(shuffleId)
   }
 
+  // Return the list of locations and blockSizes for each reducer.
+  def getStatusByReducer(shuffleId: Int): Option[Map[Int, 
Array[(BlockManagerId, Long)]]] = {
--- End diff --

also comment on how large the array is


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-07-31 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1697#discussion_r15675784
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -284,6 +290,24 @@ private[spark] class MapOutputTrackerMaster(conf: 
SparkConf)
 cachedSerializedStatuses.contains(shuffleId) || 
mapStatuses.contains(shuffleId)
   }
 
+  // Return the list of locations and blockSizes for each reducer.
+  def getStatusByReducer(shuffleId: Int): Option[Map[Int, 
Array[(BlockManagerId, Long)]]] = {
--- End diff --

comment on the thread safety


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1697#issuecomment-50801223
  
QA tests have started for PR 1697. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17592/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---