Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/14039 I haven't got anything more concrete to offer at this time than the descriptions in the relevant JIRA's, but I do have this running in production with 1.6, and it does work. Essentially, you build a cache in your application whose keys are a canonicalization of query fragments and whose values are RDDs associated with that fragment of the logical plan, and which produce the shuffle files. For as long as you hold the references to those RDDs in your cache, Spark won't remove the shuffle files. For as long as you have sufficient memory available to the OS, those shuffle files will be accessed via the OS buffer cache, which is actually pretty quick and doesn't require any of Java heap management and garbage collection. That was the original motivation behind using shuffle files in this way and before off-heap caching and unified memory management were available. It's less necessary now (at least once I figure out how to do the mapping between logical plan fragments and tables c ached off-heap), but it is still a valid alternative caching mechanism.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org