Jungtaek Lim created SPARK-35611:
------------------------------------

             Summary: Introduce the strategy on mismatched offset for start 
offset timestamp on Kafka data source
                 Key: SPARK-35611
                 URL: https://issues.apache.org/jira/browse/SPARK-35611
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 3.1.1, 3.0.2
            Reporter: Jungtaek Lim


1. Rationalization

We encountered a real-world case Spark fails the query if some of the 
partitions don't have matching offset by timestamp.

This is intended behavior to avoid bring unintended output for some cases like:

* timestamp 2 is presented as timestamp-offset, but the some of partitions 
don't have the record yet
* record with timestamp 1 comes "later" in the following micro-batch

which is possible since Kafka allows to specify the timestamp in record.

Here the unintended output we talked about was the risk of reading record with 
timestamp 1 in the next micro-batch despite the option specifying timestamp 2.

But for many cases end users just suppose timestamp is increasing 
monotonically, and current behavior blocks these cases to make progress.

2. Proposal

For the cases the timestamp is supposed to increase monotonically, it's safe to 
consider the offset to be latest (technically, offset for latest record + 1) if 
there's no matching record via timestamp.

This would be pretty much helpful for the case where there's a skew between 
partitions and some partitions have older records.

* AS-IS: Spark simply fails the query and end users have to deal with 
workarounds requiring manual steps.
* TO-BE: Spark will assign the latest offset for these partitions, so that 
Spark can read newer records from these partitions in further micro-batches.

To retain the existing behavior and also give some help for the proposed 
"TO-BE" behavior, we'd like to introduce the strategy on mismatched offset for 
start offset timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to