cloud-fan commented on PR #49961: URL: https://github.com/apache/spark/pull/49961#issuecomment-2705841733
For DS v2, the scan workflow is: 1. analyzer gets `Table` from `TableProvider`, and puts it in `DataSourceV2Relation` (for batch scan) or `StreamingRelationV2` (for streaming scan). The `Table` reports schema because the relation logical plan needs it. 2. optimizer gets `Scan` from `Table` by doing operator pushdown, and turns the relation into `DataSourceV2ScanRelation`. (streaming does not support filter pushdown yet) 3. planner gets `Batch` or `Stream` from `Scan` to build the physical scan node. Python data source API is a bit different from DS v2 and there is no 1-1 mapping. I think the current mapping is 1. analyzer gets pickled python `DataSource` instance and keeps it in `PythonDataSourceV2` which extends `TableProvider`. The `PythonTable` (extends `Table`) simply references `PythonDataSourceV2`. 2. optimizer gets `PythonScan` (extends `Scan`) from `PythonTable`. `PythonScan` also simply references `PythonTable` and does nothing else. 3. planner gets `PythonBatch` (extends `Batch`) from `PythonScan`, which does some real job: launch a python worker to create the python batch reader and pickle it. The pickled instance is kept in `PythonBatch`. Now to push down filters, we need to create the python batch reader earlier, which means one more round of Python worker communication in the optimizer. I'm wondering that once we finish pushdown, shall we do the planning work immediately and keep `PythonDataSourceReadInfo` in `PythonScan`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
