Eric Marnadi created SPARK-55795:
------------------------------------
Summary: Add automatic V1 to V2 offset log upgrade for streaming
queries with named sources
Key: SPARK-55795
URL: https://issues.apache.org/jira/browse/SPARK-55795
Project: Spark
Issue Type: Task
Components: Structured Streaming
Affects Versions: 4.2.0
Reporter: Eric Marnadi
Introduce an automatic offset log upgrade mechanism that allows streaming
queries to migrate from V1 (positional) offset tracking to V2 (named) offset
tracking when users add {{.name()}} to their streaming sources.
Currently, when users want to migrate from V1 (index-based) to V2 (name-based)
offset tracking, they must:
# Delete their checkpoint directory (losing all state)
# Start fresh
This is problematic because:
* {*}State loss{*}: All stateful operators (aggregations, joins,
deduplication) lose their state
* {*}Data reprocessing{*}: Query must reprocess all historical data from the
beginning
* {*}Downtime{*}: Requires stopping the query and careful coordination
With this change, users can safely migrate existing V1 offset logs to V2 format
by:
# Adding {{.name()}} to all streaming sources
# Setting {{spark.sql.streaming.offsetLog.formatVersion=2}}
# Setting {{spark.sql.streaming.offsetLog.v1ToV2.autoUpgrade.enabled=true}}
# Restarting the query
The upgrade preserves all state and offset positions, enabling seamless
transition to the more flexible V2 format that supports source evolution
(adding/removing sources by name).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]