waynexia commented on PR #17242: URL: https://github.com/apache/datafusion/pull/17242#issuecomment-3243522092
Thank you for this innovative discussion! Learned a lot from it 💯 After digging a bit into related code, I got a few different opinions. First, I was wondering how `DataSourceExec` yields `RecordBatchStream` after all. As mentioned by @comphead, details of this part from previous diagram: >Probably need to reorganize recordbatch block, because as per diagram it doesn't participate in the flow, but the recordbatch it is the exact we waiting from the DataSourceExec <img width="169" height="268" alt="image" src="https://github.com/user-attachments/assets/296ec961-df33-41d3-a504-2cc08941c651" /> Here is a brief walk through of current implementation: ``` DataSourceExec --holds--> DataSource --use--> FileScanConfig::open --holds--> FileSource --call--> FileSource::create_file_opener(.., FileScanConfig) --get--> FileOpener --used by--> FileStream --get--> Future<RecordBatchStream> ``` Some unreasonable points to me are: - Too many abstractions, we have File{Format|Source|Stream|Opener|ScanConfig}. Some among them are very shallow, and even with duplications. Like batch, project, limit etc between`FileScanConfig` and `FileSource`. These duplications are removed in this PR (by the way, `partition` parameter is still duplicated). Those types are - `FileScanConfig`: general, base config, one of `DataSource` - `FileSource`: per-format, format-specific configs. Can be an empty struct for simple formats. - `FileOpener`: per-format, use static configs (`FileScanConfig` and `FileSource`) and other runtime info (like `FileMeta`) to get a *future of* `RecordBatchStream`. One-to-one associated with `FileSource` and `FileFormat` - `FileStream`: general, iterates over file segments (stored in `FileScanConfig`) got from `FileOpener` - `FileScanConfig` holds `FileSource`, but it's also a parameter of `FileSource::create` - `FileOpener` returns a future of future (`pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch, ArrowError>>>>` - `FileScanConfig` is acting as the top level entrance to end users, but it's actually the middle layer, a common part of config, between specific file format and actual data stream. And that might be the cause of the second point And the final one, `DataSourceExec` doesn't look like a good abstraction to end user to me because it - can't be used with where file and memory are both available (e.g., cached record batch). - is not the best choice when I just want to read some files. I made an example to prove this in #17369 - from functionality aspect should just be a stream merger (an exec plan that merges multiple `RecordBatchStream`s, either memory of file, like `CombinedRecordBatchStream`) and an adapter (proxy adjusting/property APIs on `ExecutionPlan` to record batch provider). ------------------------------- I'd like to propose a few changes based on current PR (as a discussion in this refactor topic, not as a review comment and might be too large for one PR). - Remove `FileSource` trait A fundamental thing we want to achieve from those types is `Base Config + Format Config + Runtime Info -> File Data`. `FileSource` as an *optional* format config doesn't need to be a general trait. Moving it to `FileFormat` looks more natural and closer to where it's being used. - Use `FileFormat` as the top-level entrance `FileFormat` takes `FilsScanConfig` and format config stored in it to produce the associated `FileOpener`. - Change the return type of `FileOpener::open` to just a stream to simplify the logic of `FileStream` This can likely simplify the logic of `FileStream` by a lot https://github.com/apache/datafusion/blob/3abee73318f9f422e1b31ff8bb6a42362837d71d/datafusion/datasource/src/file_stream.rs#L137-L324 - `FileFormat::create_physical_plan` can be simplified. Its implementations are identical -------------------------- A diagram covers above proposed changes (drawio xml embedded): <img width="2195" height="2555" alt="Untitled Diagram (3)" src="https://github.com/user-attachments/assets/74ec2a26-dcb6-49df-8960-58f1acf5f273" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
