Jungtaek Lim created SPARK-38684:
------------------------------------

             Summary: Stream-stream outer join has a possible correctness issue 
due to weakly read consistent on outer iterators
                 Key: SPARK-38684
                 URL: https://issues.apache.org/jira/browse/SPARK-38684
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 3.2.1, 3.3.0
            Reporter: Jungtaek Lim


We figured out stream-stream join has the same issue with SPARK-38320 on the 
appended iterators. Since the root cause is same as SPARK-38320, this is only 
reproducible with RocksDB state store provider, but even with HDFS backed state 
store provider, it is not guaranteed by interface contract hence may depend on 
the JVM vendor, version, etc.

I can easily construct the scenario of “data loss” in state store.

Condition follows:
 * Use stream-stream time interval outer join

 ** left outer join has an issue on left side, right outer join has an issue on 
right side, full outer join has an issue on both sides

 * At batch N, produce row(s) on the problematic side which are non-late

 * At the same batch (batch N), some row(s) on the problematic side should be 
evicted by watermark condition

When the condition is fulfilled, out of sync happens with keyToNumValues 
between state and the iterator in evict phase. If eviction of the row happens 
for the grouping key (updating keyToNumValues), the eviction phase “overwrites” 
keyToNumValues in the state as the value it calculates.

Given that the eviction phase “do not know” about the new rows (keyToNumValues 
is out of sync), effectively discarding all rows from the state being added in 
the batch N.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to