[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304114#comment-14304114 ]
Sean Owen commented on SPARK-5140: ---------------------------------- Is this not substantially answered by just materializing the cached RDD right after you cache it? then anything that happens after already sees a cached RDD. Is the request basically to automatically persist and unpersist RDDs to implement this? I suppose the issue is simply that this is hard to figure out. Even if you can figure out 2 RDDs can be computed in parallel, and need to be, and depend on one parent, it's not obvious you can just persist the RDD automatically. I guess the question is what specifically would this change look like? > Two RDDs which are scheduled concurrently should be able to wait on parent in > all cases > --------------------------------------------------------------------------------------- > > Key: SPARK-5140 > URL: https://issues.apache.org/jira/browse/SPARK-5140 > Project: Spark > Issue Type: New Feature > Reporter: Corey J. Nolet > Labels: features > Fix For: 1.3.0, 1.2.1 > > > Not sure if this would change too much of the internals to be included in the > 1.2.1 but it would be very helpful if it could be. > This ticket is from a discussion between myself and [~ilikerps]. Here's the > result of some testing that [~ilikerps] did: > bq. I did some testing as well, and it turns out the "wait for other guy to > finish caching" logic is on a per-task basis, and it only works on tasks that > happen to be executing on the same machine. > bq. Once a partition is cached, we will schedule tasks that touch that > partition on that executor. The problem here, though, is that the cache is in > progress, and so the tasks are still scheduled randomly (or with whatever > locality the data source has), so tasks which end up on different machines > will not see that the cache is already in progress. > {code} > Here was my test, by the way: > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent._ > import scala.concurrent.duration._ > val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(10000); i > }).cache() > val futures = (0 until 4).map { _ => Future { rdd.count } } > Await.result(Future.sequence(futures), 120.second) > {code} > bq. Note that I run the future 4 times in parallel. I found that the first > run has all tasks take 10 seconds. The second has about 50% of its tasks take > 10 seconds, and the rest just wait for the first stage to finish. The last > two runs have no tasks that take 10 seconds; all wait for the first two > stages to finish. > What we want is the ability to fire off a job and have the DAG figure out > that two RDDs depend on the same parent so that when the children are > scheduled concurrently, the first one to start will activate the parent and > both will wait on the parent. When the parent is done, they will both be able > to finish their work concurrently. We are trying to use this pattern by > having the parent cache results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org