Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

via GitHub Fri, 26 Apr 2024 01:03:37 -0700


HeartSaVioR commented on code in PR #45977:
URL: https://github.com/apache/spark/pull/45977#discussion_r1580636552



##########
python/pyspark/sql/datasource.py:
##########
@@ -469,6 +501,188 @@ def stop(self) -> None:
         ...
 
 
+class SimpleInputPartition(InputPartition):
+    def __init__(self, start: dict, end: dict):
+        self.start = start
+        self.end = end
+
+
+class SimpleDataSourceStreamReader(ABC):
+    """
+    A base class for simplified streaming data source readers. Compared to 
DataSourceStreamReader,
+    SimpleDataSourceStreamReader doesn't require planning data partitioning. 
Also, the read api of
+    SimpleDataSourceStreamReader allows reading data and planning the latest 
offset at the same time.
+
+    .. versionadded: 4.0.0
+    """
+
+    def initialOffset(self) -> dict:
+        """
+        Return the initial offset of the streaming data source.
+        A new streaming query starts reading data from the initial offset.
+        If Spark is restarting an existing query, it will restart from the 
check-pointed offset
+        rather than the initial one.
+
+        Returns
+        -------
+        dict
+            A dict or recursive dict whose key and value are primitive types, 
which includes
+            Integer, String and Boolean.
+
+        Examples
+        --------
+        >>> def initialOffset(self):
+        ...     return {"parititon-1": {"index": 3, "closed": True}, 
"partition-2": {"index": 5}}
+        """
+        raise PySparkNotImplementedError(
+            error_class="NOT_IMPLEMENTED",
+            message_parameters={"feature": "initialOffset"},
+        )
+
+    def read(self, start: dict) -> (Iterator[Tuple], dict):
+        """
+        Read all available data from specified start offset and return the 
offset that next read attempt
+        starts from.
+
+        Parameters
+        ----------
+        start : dict
+            The start offset to start reading from.
+
+        Returns
+        -------
+        A tuple of an iterator of :class:`Tuple` and a dict\\s
+            The iterator contains all the available records after start offset.
+            The dict is the end of this read attempt and the start of next 
read attempt.
+        """
+        raise PySparkNotImplementedError(
+            error_class="NOT_IMPLEMENTED",
+            message_parameters={"feature": "read"},
+        )
+
+    def read2(self, start: dict, end: dict) -> Iterator[Tuple]:

Review Comment:
   The name itself might be OK. Maybe we have an option to make the both of 
method names be self-descriptive (not just read), but if we prefer shorter 
name, maybe OK to have either to be "read".
   
   I see a bigger issue on implementation. Let's address that first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source [spark]

Reply via email to