JarroVGIT opened a new issue, #1829: URL: https://github.com/apache/datafusion-ballista/issues/1829
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Single node DatafFusion creates a `MemTable` when you call [`DataFrame.cache()`](https://docs.rs/datafusion/53.1.0/src/datafusion/dataframe/mod.rs.html#2368). It essentially executes the plan, and stores the resulting partitions as a `MemTable` in the SessionState, returning a `DataFrame` that will just read from that `MemTable`. This is currently not supported in Ballista, and supporting it presents some challenges due to the distributed nature of the execution. For example, where would the cache "live" and how would executioners get their hands on it during execution? Another problem (I think, could be wrong!) is that `CacheFactory::create()` is a synchronous function. Ballista already implements some of the extension points to handle a call to `.cache()`. It has an implementation of `CacheProvider` (`BallistaCacheFactory` in the core crate), which either returns the `LogicalPlan` unchanged or wraps it in a `BallistaCacheNode`. The behaviour is steered by a configuration option (`ballista.cache.noop` which defaults to true), and if you were to enable this option today, the execution fails at physical planning because no `ExtensionPlanner` knows how to turn it into an `ExecutionPlan`. The basic wiring is in place, now we need to figure out what "caching" means in an executed context and what is stored where. I am doing some reading and research on what approaches would be possible and will post here if I find anything worth discussing. **Describe alternatives you've considered** I think we should be able to re-use the Shuffle mechanics to store partitions on executioners and leverage the same mechanics to get the data to the right executioner. The previous sentence is based fully on my (limited) mental model of Ballista, so if it doesn't make sense I will soon find out I guess. **Additional context** (copied from Datafusion example docs) There is an example in Datafusion (`datafusion-examples`) which demonstrates how to leverage `CacheFactory` to implement custom caching strategies for dataframes in DataFusion. By default, `DataFrame::cache` in Datafusion is eager and creates an in-memory table. This example shows a basic alternative implementation for lazy caching. Specifically, it implements: - A `CustomCacheFactory` that creates a logical node `CacheNode` representing the cache operation. - A `CacheNodePlanner` (an `ExtensionPlanner`) that understands `CacheNode` and performs caching. - A `CacheNodeQueryPlanner` that installs `CacheNodePlanner`. - A simple in-memory `CacheManager` that stores cached `RecordBatch`es. Note that the implementation for this example is very naive and only implements put, but for real production use cases cache eviction and drop should also be implemented. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
