waynexia commented on PR #17242:
URL: https://github.com/apache/datafusion/pull/17242#issuecomment-3243522092

   Thank you for this innovative discussion! Learned a lot from it 💯 
   
   After digging a bit into related code, I got a few different opinions. 
First, I was wondering how `DataSourceExec` yields `RecordBatchStream` after 
all. As mentioned by @comphead, details of this part from previous diagram:
   
   >Probably need to reorganize recordbatch block, because as per diagram it 
doesn't participate in the flow, but the recordbatch it is the exact we waiting 
from the DataSourceExec
   
   <img width="169" height="268" alt="image" 
src="https://github.com/user-attachments/assets/296ec961-df33-41d3-a504-2cc08941c651";
 />
   
   
   Here is a brief walk through of current implementation:
   ```
   DataSourceExec --holds--> DataSource
     --use--> FileScanConfig::open
     --holds--> FileSource --call--> FileSource::create_file_opener(.., 
FileScanConfig)
     --get--> FileOpener
     --used by--> FileStream
     --get--> Future<RecordBatchStream>
   ```
   
   Some unreasonable points to me are:
   - Too many abstractions, we have 
File{Format|Source|Stream|Opener|ScanConfig}. Some among them are very shallow, 
and even with duplications. Like batch, project, limit etc 
between`FileScanConfig` and `FileSource`. These duplications are removed in 
this PR (by the way, `partition` parameter is still duplicated). Those types are
     - `FileScanConfig`: general, base config, one of `DataSource`
     - `FileSource`: per-format, format-specific configs. Can be an empty 
struct for simple formats.
     - `FileOpener`: per-format, use static configs (`FileScanConfig` and 
`FileSource`) and other runtime info (like `FileMeta`) to get a *future of* 
`RecordBatchStream`. One-to-one associated with `FileSource` and `FileFormat`
     - `FileStream`: general, iterates over file segments (stored in 
`FileScanConfig`) got from `FileOpener`
   - `FileScanConfig` holds `FileSource`, but it's also a parameter of 
`FileSource::create`
   - `FileOpener` returns a future of future (`pub type FileOpenFuture = 
BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch, ArrowError>>>>`
   - `FileScanConfig` is acting as the top level entrance to end users, but 
it's actually the middle layer, a common part of config, between specific file 
format and actual data stream. And that might be the cause of the second point
   
   And the final one, `DataSourceExec` doesn't look like a good abstraction to 
end user to me because it
   - can't be used with where file and memory are both available (e.g., cached 
record batch).
   - is not the best choice when I just want to read some files. I made an 
example to prove this in #17369 
   - from functionality aspect should just be a stream merger (an exec plan 
that merges multiple `RecordBatchStream`s, either memory of file, like 
`CombinedRecordBatchStream`) and an adapter (proxy adjusting/property APIs on 
`ExecutionPlan` to record batch provider).
   
   -------------------------------
   
   I'd like to propose a few changes based on current PR (as a discussion in 
this refactor topic, not as a review comment and might be too large for one PR).
   
   - Remove `FileSource` trait
   
     A fundamental thing we want to achieve from those types is `Base Config + 
Format Config + Runtime Info -> File Data`. `FileSource` as an *optional* 
format config doesn't need to be a general trait. Moving it to `FileFormat` 
looks more natural and closer to where it's being used. 
   
   - Use `FileFormat` as the top-level entrance
   
     `FileFormat` takes `FilsScanConfig` and format config stored in it to 
produce the associated `FileOpener`.
   
   - Change the return type of `FileOpener::open` to just a stream to simplify 
the logic of `FileStream`
   
     This can likely simplify the logic of `FileStream` by a lot 
https://github.com/apache/datafusion/blob/3abee73318f9f422e1b31ff8bb6a42362837d71d/datafusion/datasource/src/file_stream.rs#L137-L324
   
   - `FileFormat::create_physical_plan` can be simplified. Its implementations 
are identical
   
   --------------------------
   
   A diagram covers above proposed changes (drawio xml embedded):
   
   <img width="2195" height="2555" alt="Untitled Diagram (3)" 
src="https://github.com/user-attachments/assets/74ec2a26-dcb6-49df-8960-58f1acf5f273";
 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to