[ https://issues.apache.org/jira/browse/SPARK-47793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jungtaek Lim reassigned SPARK-47793: ------------------------------------ Assignee: Chaoqin Li > Implement SimpleDataSourceStreamReader for python streaming data source > ----------------------------------------------------------------------- > > Key: SPARK-47793 > URL: https://issues.apache.org/jira/browse/SPARK-47793 > Project: Spark > Issue Type: New Feature > Components: PySpark, SS > Affects Versions: 3.5.1 > Reporter: Chaoqin Li > Assignee: Chaoqin Li > Priority: Major > Labels: pull-request-available > > SimpleDataSourceStreamReader is a simplified version of the DataStreamReader > interface. > # It doesn’t require developers to reason about data partitioning. > # It doesn’t require getting the latest offset before reading data. > There are 3 functions that needs to be defined > 1. Read data and return the end offset. > _def read(self, start: Offset) -> (Iterator[Tuple], Offset)_ > 2. Read data between start and end offset, this is required for exactly once > read. > _def read2(self, start: Offset, end: Offset) -> Iterator[Tuple]_ > 3. initial start offset of the streaming query. > def initialOffset() -> dict > Implementation: Wrap the SimpleDataSourceStreamReader instance in a > DataSourceStreamReader internally and make the prefetching and caching > transparent to the data source developer. The record prefetched in python > process will be sent to JVM as arrow record batches. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org