Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/126#discussion_r10582800 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -181,15 +178,50 @@ private[spark] class MapOutputTracker(conf: SparkConf) extends Logging { } } +/** + * MapOutputTracker for the workers. This uses BoundedHashMap to keep track of + * a limited number of most recently used map output information. + */ +private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTracker(conf) { + + /** + * Bounded HashMap for storing serialized statuses in the worker. This allows + * the HashMap stay bounded in memory-usage. Things dropped from this HashMap will be + * automatically repopulated by fetching them again from the driver. Its okay to + * keep the cache size small as it unlikely that there will be a very large number of + * stages active simultaneously in the worker. + */ + protected val mapStatuses = new BoundedHashMap[Int, Array[MapStatus]]( --- End diff -- Andrew and I did some pair programming and I think we figured out how to remove all of the BoundedHashMap's: https://github.com/pwendell/spark/commit/dc42db62426fddc8cbe961d9c2b3af1bf1ad14c5
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---