milenkovicm commented on issue #17297: URL: https://github.com/apache/datafusion/issues/17297#issuecomment-3508815981
Hi @jizezhang, thanks for sharing interest, > Hi [@milenkovicm](https://github.com/milenkovicm) , I am interested in this task and would like to understand the proposal better. Is the idea that > > * `DataFrame::cache` will create a new logical plan with a `TableScan` node as the root and current logical plan of the dataframe (via `self.plan`) as its child? yes, new logical plan node to be created, now its question which one to be creted > * To capture the lineage, `TableScan` would be constructed with a custom `InMemoryTableSource` that overrides the `get_logical_plan` method from the trait `TableSource`? I have tried that, it did not work as expected, I don't remember reason why > * Would `DataFrame::cache` returns `ctx.execute_logical_plan(new_plan)`? Not necessarily to execute logical plan right then, it should just return logical plan node and let user decide when to run the plan . I believe the simplest approach would be to create a factory method which user can configure Something like: ```rust pub async fn cache(self) -> Result<DataFrame> { if let Some(cache_producer) = self.state().cache_producer() { cache_producer(&self) } else { // this is current behaviour let context = SessionContext::new_with_state((*self.session_state).clone()); // The schema is consistent with the output let plan = self.clone().create_physical_plan().await?; let schema = plan.schema(); let task_ctx = Arc::new(self.task_ctx()); let partitions = collect_partitioned(plan, task_ctx).await?; let mem_table = MemTable::try_new(schema, partitions)?; context.read_table(Arc::new(mem_table)) } } ``` `cache_producer` would be a method accepting DataFrame and returning DataFrame, probably a logical plan extension. Does it make sense? Let me know what you think -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
