[ 
https://issues.apache.org/jira/browse/SPARK-38684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-38684:
------------------------------------

    Assignee: Jungtaek Lim

> Stream-stream outer join has a possible correctness issue due to weakly read 
> consistent on outer iterators
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-38684
>                 URL: https://issues.apache.org/jira/browse/SPARK-38684
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 3.2.1, 3.3.0
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Blocker
>              Labels: correctness
>
> We figured out stream-stream join has the same issue with SPARK-38320 on the 
> appended iterators. Since the root cause is same as SPARK-38320, this is only 
> reproducible with RocksDB state store provider, but even with HDFS backed 
> state store provider, it is not guaranteed by interface contract hence may 
> depend on the JVM vendor, version, etc.
> I can easily construct the scenario of “data loss” in state store.
> Condition follows:
>  * Use stream-stream time interval outer join
>  ** left outer join has an issue on left side, right outer join has an issue 
> on right side, full outer join has an issue on both sides
>  * At batch N, produce row(s) on the problematic side which are non-late
>  * At the same batch (batch N), some row(s) on the problematic side should be 
> evicted by watermark condition
> When the condition is fulfilled, out of sync happens with keyToNumValues 
> between state and the iterator in evict phase. If eviction of the row happens 
> for the grouping key (updating keyToNumValues), the eviction phase 
> “overwrites” keyToNumValues in the state as the value it calculates.
> Given that the eviction phase “do not know” about the new rows 
> (keyToNumValues is out of sync), effectively discarding all rows from the 
> state being added in the batch N.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to