Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

via GitHub Fri, 07 Mar 2025 00:35:04 -0800


cloud-fan commented on PR #49961:
URL: https://github.com/apache/spark/pull/49961#issuecomment-2705841733


   For DS v2, the scan workflow is:
   1. analyzer gets `Table` from `TableProvider`, and puts it in 
`DataSourceV2Relation` (for batch scan) or `StreamingRelationV2` (for streaming 
scan). The `Table` reports schema because the relation logical plan needs it.
   2. optimizer gets `Scan` from `Table` by doing operator pushdown, and turns 
the relation into `DataSourceV2ScanRelation`. (streaming does not support 
filter pushdown yet)
   3. planner gets `Batch` or `Stream` from `Scan` to build the physical scan 
node.
   
   Python data source API is a bit different from DS v2 and there is no 1-1 
mapping. I think the current mapping is
   1. analyzer gets pickled python `DataSource` instance and keeps it in 
`PythonDataSourceV2` which extends `TableProvider`. The `PythonTable` (extends 
`Table`) simply references `PythonDataSourceV2`.
   2. optimizer gets `PythonScan` (extends `Scan`) from `PythonTable`. 
`PythonScan` also simply references `PythonTable` and does nothing else.
   3. planner gets `PythonBatch` (extends `Batch`) from `PythonScan`, which 
does some real job: launch a python worker to create the python batch reader 
and pickle it. The pickled instance is kept in `PythonBatch`.
   
   Now to push down filters, we need to create the python batch reader earlier, 
which means one more round of Python worker communication in the optimizer. I'm 
wondering that once we finish pushdown, shall we do the planning work 
immediately and keep `PythonDataSourceReadInfo` in `PythonScan`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-51271][PYTHON] Add filter pushdown API to Python Data Sources [spark]

Reply via email to