Re: [PR] [SPARK-49463] Support ListState for TransformWithStateInPandas [spark]

via GitHub Tue, 24 Sep 2024 11:11:00 -0700


bogao007 commented on code in PR #47933:
URL: https://github.com/apache/spark/pull/47933#discussion_r1773807929



##########
python/pyspark/sql/streaming/stateful_processor.py:
##########
@@ -99,25 +99,25 @@ def exists(self) -> bool:
         """
         return self._list_state_client.exists(self._state_name)
 
-    def get(self) -> Iterator[Row]:
+    def get(self) -> Iterator[Tuple]:

Review Comment:
   Yep, we store as `Row` in state store. The pickling thing is that I followed 
what ApplyInPandasWithStateWriter does for groupState. It serializes `Row` 
state to bytes in JVM: 
https://github.com/apache/spark/blob/55d0233d19cc52bee91a9619057d9b6f33165a0a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ApplyInPandasWithStateWriter.scala#L182
 and then deserialize using pickling in python: 
https://github.com/apache/spark/blob/55d0233d19cc52bee91a9619057d9b6f33165a0a/python/pyspark/sql/pandas/serializers.py#L837
 Are you saying we shouldn't use `pickleSer.loads` in valueState `get()`? If 
so, could you share your concerns here?
   
   An alternative I can think about is to let valueState also use arrow to 
transmit the state row from JVM to python side and deserialize just like list 
state. Let me know your thoughts on this, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-49463] Support ListState for TransformWithStateInPandas [spark]

Reply via email to