Re: [I] Add support for `DataFrame.cache()` to Ballista [datafusion-ballista]

via GitHub Sun, 07 Jun 2026 08:39:15 -0700


JarroVGIT commented on issue #1829:
URL: 
https://github.com/apache/datafusion-ballista/issues/1829#issuecomment-4643126276


   > Breaking down into smaller deliverables would be appreciated
   
   Yes definitely will try, but first the design/idea. My current line of 
thinking (with a fair bit of support from my friendly neighbourhood LLM) is 
something along the line of this:
   - A cached dataset is "just" shuffle output that is kept instead of cleaned 
up: the materialized partitions stay as Arrow IPC files on the executors that 
produced them, in a pinned directory excluded from normal job/TTL cleanup.
   - The scheduler holds a cache manager (a cross-job registry mapping 
`cache_id` -> `Vec<Vec<PartitionLocation>>`, where `cache_id` is coming from 
the client in `BallistaCacheNode`) that records where each cached partition 
lives. Important to note that it outlives the job that created it.
   - On a cache miss the planner (an `ExtensionPlanner` for the existing 
`BallistaCacheNode`) inserts a new `BallistaCacheWriterExec` that runs the 
subplan as a pinned stage and writes the cache files; when the stage completes, 
its `PartitionLocations` are stored in the registry. This is very similar to 
the how `ShuffleWriterExec` returns, if I am not mistaking.
   - On a cache hit the planner skips the subplan entirely and inserts a 
`ShuffleReaderExec` over the cached `PartitionLocations`, so the data is read 
back over the existing local-file / Arrow Flight path.
   
   I don't want to implement this through AI, but it does help me quite a bit 
to explore and understand the current codebase a lot quicker. Not all details 
are ironed out but the above does some up where my head is right now.
   
   On your points on what to consider, @milenkovicm :
   - _Scheduling tasks reading cache on executors having cache partition_
     - In this way, it would be very similar to normal shuffle file mechanics, 
which of course could be optimised later maybe in the planner/execution such 
that there is more deterministic partitioning?
   - _executor failures and cache reconciliation_
     - I was hoping if it was okay to basically go with a "if executer dies, 
cache is invalidated" approach. The least resillient, but I think the easiest 
to implement. I haven't spend any time yet on how Ballista handles this with 
normal shuffle files though, so maybe I think this is way harder than it 
actually is. 
   - _Cache spilling_
     - Apologies, I am not sure what you mean by this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Add support for `DataFrame.cache()` to Ballista [datafusion-ballista]

Reply via email to