JarroVGIT opened a new issue, #1829:
URL: https://github.com/apache/datafusion-ballista/issues/1829

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Single node DatafFusion creates a `MemTable` when you call 
[`DataFrame.cache()`](https://docs.rs/datafusion/53.1.0/src/datafusion/dataframe/mod.rs.html#2368).
 It essentially executes the plan, and stores the resulting partitions as a 
`MemTable` in the SessionState, returning a `DataFrame` that will just read 
from that `MemTable`.
   
   This is currently not supported in Ballista, and supporting it presents some 
challenges due to the distributed nature of the execution. For example, where 
would the cache "live" and how would executioners get their hands on it during 
execution? Another problem (I think, could be wrong!) is that 
`CacheFactory::create()` is a synchronous function.
   
   Ballista already implements some of the extension points to handle a call to 
`.cache()`. It has an implementation of `CacheProvider` (`BallistaCacheFactory` 
in the core crate), which either returns the `LogicalPlan` unchanged or wraps 
it in a `BallistaCacheNode`. The behaviour is steered by a configuration option 
(`ballista.cache.noop` which defaults to true), and if you were to enable this 
option today, the execution fails at physical planning because no 
`ExtensionPlanner` knows how to turn it into an `ExecutionPlan`. 
   
   The basic wiring is in place, now we need to figure out what "caching" means 
in an executed context and what is stored where. I am doing some reading and 
research on what approaches would be possible and will post here if I find 
anything worth discussing.
   
   
   **Describe alternatives you've considered**
   I think we should be able to re-use the Shuffle mechanics to store 
partitions on executioners and leverage the same mechanics to get the data to 
the right executioner. The previous sentence is based fully on my (limited) 
mental model of Ballista, so if it doesn't make sense I will soon find out I 
guess.
   
   **Additional context**
   (copied from Datafusion example docs)
   There is an example in Datafusion (`datafusion-examples`) which demonstrates 
how to leverage `CacheFactory` to implement custom caching strategies for 
dataframes in DataFusion. By default, `DataFrame::cache` in Datafusion is eager 
and creates an in-memory table. This example shows a basic alternative 
implementation for lazy caching. Specifically, it implements:
   - A `CustomCacheFactory` that creates a logical node `CacheNode` 
representing the cache operation.
   - A `CacheNodePlanner` (an `ExtensionPlanner`) that understands `CacheNode` 
and performs caching.
   - A `CacheNodeQueryPlanner` that installs `CacheNodePlanner`.
   - A simple in-memory `CacheManager` that stores cached `RecordBatch`es. Note 
that the implementation for this example is very naive and only implements put, 
but for real production use cases cache eviction and drop should also be 
implemented.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to