[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1697 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-70468626 Let's close this issue pending an update from @shivaram (just doing some JIRA clean-up). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-68015465 Sure. I'll bring this up to date, put it behind a config flag this week and ping the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-68011432 I agree with @rxin; I'd be totally fine with including this an an experimental feature, perhaps opt-in while we test it (like we did with sort-based shuffle). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-61425923 Can we bring this up to date, and: 1. Add a switch to turn it on / off 2. Add a config option to disable this automatically when num reduce / map tasks is greater than a certain threshold? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-60624723 Ping @rxin -- Any thoughts on this ? I can merge to upstream and it'll be great to have this in 1.2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-50948548 QA results for PR 1697:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17715/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-50945636 QA tests have started for PR 1697. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17715/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-50930865 One more thing we can do is to coalesce sizes from all tasks on a machine and only do node-level locality. As map outputs are on disk there shouldn't be any difference for node vs. process level locality ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-50930229 I ran some microbenchmarks as outlined at https://gist.github.com/shivaram/63620c47f0ad50106e0a The comments below the gist have some numbers that I got on my laptop. Overall I think we should just use a upper bound on the number of map tasks and not return any preferred locations if we have more than say 1000 map tasks. There might be some more optimization we can do in terms of filtering out zeros etc. but a simple heuristic might be a good and safe start for now. @rxin Thoughts ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-50847052 QA results for PR 1697:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17631/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-50844964 QA tests have started for PR 1697. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17631/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-50844818 I switched to using Guava's ordering function now and added another unit test for that. I plan to do a microbenchmark to see how long it takes to get top 5 from a list of Longs. @kayousterhout -- Is there a way to benchmark the scheduler to see if a change introduces any performance regressions ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/1697#discussion_r15681031 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -284,6 +290,24 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf) cachedSerializedStatuses.contains(shuffleId) || mapStatuses.contains(shuffleId) } + // Return the list of locations and blockSizes for each reducer. + def getStatusByReducer(shuffleId: Int): Option[Map[Int, Array[(BlockManagerId, Long)]]] = { --- End diff -- Added comments -- This method is not thread safe as TimestampedHashMap is not thread safe. However we only call this from DAGScheduler which is single threaded AFAIK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/1697#discussion_r15681022 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1152,6 +1155,18 @@ class DAGScheduler( return locs } } + case s: ShuffleDependency[_, _, _] => --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-50831954 Thanks for taking a look -- One thing I realized is that we only need top-5 and don't need to sort the data. I'll try to use the Guava Ordering class and do some benchmarks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-50831248 I have some concern (maybe unfounded) about runtime. If we have 50k map tasks and 10k reduce tasks, this would reduce doing 10k sort, each on 50k items right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1697#discussion_r15675968 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -232,6 +232,11 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf) protected val mapStatuses = new TimeStampedHashMap[Int, Array[MapStatus]]() private val cachedSerializedStatuses = new TimeStampedHashMap[Int, Array[Byte]]() + // For each shuffleId we also maintain a Map from reducerId -> (location, size) + // Lazily populated whenever the statuses are requested from DAGScheduler + private val statusByReducer = +new TimeStampedHashMap[Int, HashMap[Int, Array[(BlockManagerId, Long)]]]() --- End diff -- should we consider sampling the map tasks to speed up the sort? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1697#discussion_r15675840 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1152,6 +1155,18 @@ class DAGScheduler( return locs } } + case s: ShuffleDependency[_, _, _] => --- End diff -- add some inline comment explaining this case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1697#discussion_r15675815 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -284,6 +290,24 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf) cachedSerializedStatuses.contains(shuffleId) || mapStatuses.contains(shuffleId) } + // Return the list of locations and blockSizes for each reducer. + def getStatusByReducer(shuffleId: Int): Option[Map[Int, Array[(BlockManagerId, Long)]]] = { --- End diff -- also comment on how large the array is --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1697#discussion_r15675784 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -284,6 +290,24 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf) cachedSerializedStatuses.contains(shuffleId) || mapStatuses.contains(shuffleId) } + // Return the list of locations and blockSizes for each reducer. + def getStatusByReducer(shuffleId: Int): Option[Map[Int, Array[(BlockManagerId, Long)]]] = { --- End diff -- comment on the thread safety --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2774] - Set preferred locations for red...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1697#issuecomment-50801223 QA tests have started for PR 1697. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17592/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---