[ 
https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874538#comment-16874538
 ] 

Jungtaek Lim commented on SPARK-28192:
--------------------------------------

I realized new DSv2 (maybe old DSv2 too?) requires Dataframe to be partitioned 
correctly before putting sink. State writer is not the case, as there's no 
storage coordinating this. It should repartition via key by itself, which could 
be possible with DSv1 (since it provides Dataframe to write) but no longer 
possible with DSv2.

[https://github.com/HeartSaVioR/spark-state-tools/blob/2f97f264186e852144e7ec3f9b2ab3dda4e45179/src/main/scala/net/heartsavior/spark/sql/state/StateStoreWriter.scala#L63-L75]

[~rdblue] [~cloud_fan] Which would be the best to address this? Would I need to 
wrap this with some method to handle repartition before adding to sink?

> Data Source - State - Write side
> --------------------------------
>
>                 Key: SPARK-28192
>                 URL: https://issues.apache.org/jira/browse/SPARK-28192
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Structured Streaming
>    Affects Versions: 3.0.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>
> This issue tracks the efforts on addressing batch write on state data source.
> It could include "state repartition" if it doesn't require huge effort for 
> new DSv2, but it can be also move out to separate issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to