[ 
https://issues.apache.org/jira/browse/SPARK-47793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-47793:
------------------------------------

    Assignee: Chaoqin Li

> Implement SimpleDataSourceStreamReader for python streaming data source
> -----------------------------------------------------------------------
>
>                 Key: SPARK-47793
>                 URL: https://issues.apache.org/jira/browse/SPARK-47793
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, SS
>    Affects Versions: 3.5.1
>            Reporter: Chaoqin Li
>            Assignee: Chaoqin Li
>            Priority: Major
>              Labels: pull-request-available
>
>  SimpleDataSourceStreamReader is a simplified version of the DataStreamReader 
> interface.
>  # It doesn’t require developers to reason about data partitioning.
>  # It doesn’t require getting the latest offset before reading data.
> There are 3 functions that needs to be defined 
> 1. Read data and return the end offset.
> _def read(self, start: Offset) -> (Iterator[Tuple], Offset)_
> 2. Read data between start and end offset, this is required for exactly once 
> read.
> _def read2(self, start: Offset, end: Offset) -> Iterator[Tuple]_
> 3. initial start offset of the streaming query.
> def initialOffset() -> dict
> Implementation: Wrap the SimpleDataSourceStreamReader instance in a 
> DataSourceStreamReader internally and make the prefetching and caching 
> transparent to the data source developer. The record prefetched in python 
> process will be sent to JVM as arrow record batches.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to