[ https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874538#comment-16874538 ]
Jungtaek Lim commented on SPARK-28192: -------------------------------------- I realized new DSv2 (maybe old DSv2 too?) requires Dataframe to be partitioned correctly before putting sink. State writer is not the case, as there's no storage coordinating this. It should repartition via key by itself, which could be possible with DSv1 (since it provides Dataframe to write) but no longer possible with DSv2. [https://github.com/HeartSaVioR/spark-state-tools/blob/2f97f264186e852144e7ec3f9b2ab3dda4e45179/src/main/scala/net/heartsavior/spark/sql/state/StateStoreWriter.scala#L63-L75] [~rdblue] [~cloud_fan] Which would be the best to address this? Would I need to wrap this with some method to handle repartition before adding to sink? > Data Source - State - Write side > -------------------------------- > > Key: SPARK-28192 > URL: https://issues.apache.org/jira/browse/SPARK-28192 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming > Affects Versions: 3.0.0 > Reporter: Jungtaek Lim > Priority: Major > > This issue tracks the efforts on addressing batch write on state data source. > It could include "state repartition" if it doesn't require huge effort for > new DSv2, but it can be also move out to separate issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org