[ 
https://issues.apache.org/jira/browse/SPARK-40659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613073#comment-17613073
 ] 

Mohan Parthasarathy commented on SPARK-40659:
---------------------------------------------

Okay, I see what you are saying. There are two issues. Learning the new schema 
when schema changes and a new version of the application that understands the 
new field.

Learning the new schema is simple as you indicate in the description. When you 
are deserializing, you use the schema-id from the message to get the schema and 
if it is available in the local cache, that is returned. Otherwise, you fetch 
the schema from the registry. (at least that's how confluent schema APIs work i 
think). So, when the schema changes, ID would change and that would 
automatically fetch the new schema. 

Regarding application restart, why should Spark take on that responsibility ? 
Customers could roll out the new service at their discretion. Why do we want to 
induce an error to cause restart ? It is possible that some service is not 
ready to handle the new schema yet ? Trying to understand the motivation here. 

> Schema evolution for protobuf (and Avro too?)
> ---------------------------------------------
>
>                 Key: SPARK-40659
>                 URL: https://issues.apache.org/jira/browse/SPARK-40659
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.3.0
>            Reporter: Raghu Angadi
>            Priority: Major
>
> Protobuf & Avro should support schema evolution in streaming. We need to 
> throw a specific error message when we detect newer version of the the schema 
> in schema registry.
> A couple of options for detecting version change at runtime:
>  * How do we detect newer version from schema registry? It is contacted only 
> during planning currently.
>  * We could detect version id in coming messages.
>  ** What if the id in the incoming message is newer than what our 
> schema-registry reports after the restart?
>  *** This indicates delayed syncs between customers schema-registry servers 
> (should be rare). We can keep erroring out until it is fixed.
>  *** Make sure we log the schema id used during planning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to