[ 
https://issues.apache.org/jira/browse/SPARK-40659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613099#comment-17613099
 ] 

Raghu Angadi commented on SPARK-40659:
--------------------------------------

{quote}Regarding application restart, why should Spark take on that 
responsibility ? Customers could roll out the new service at their discretion. 
Why do we want to induce an error to cause restart ?
{quote}
It is too much hassle for users think about schema changes and re-deploy the 
pipeline. Imagine, they detect couple of days late. Should they go back and 
reprocess that data? Often the owners of the pipeline don't even know if and 
when schema changes since other other part of the engineering org might have 
updated. 

This is a very common feature request from customers. Quite impractical for 
them to think about these issues, let alone taking a remedial action. With the 
protobuf support we are adding, mostly users that write the pipeline don't even 
need to know anything about Protobuf and schema, they just use the built-in 
function and run their Structured Streaming pipeline only thinking in terms for 
Spark schema. 
{quote} It is possible that some service is not ready to handle the new schema 
yet ? Trying to understand the motivation here. 
{quote}
Possible, but through rare. We could offer disabling the feature. But note that 
if the user restarts the pipeline for a different reason, they will still end 
up getting new schema, and schema evolution that time. 

In most cases downstream systems might just ignore the extra columns. 

 

> Schema evolution for protobuf (and Avro too?)
> ---------------------------------------------
>
>                 Key: SPARK-40659
>                 URL: https://issues.apache.org/jira/browse/SPARK-40659
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.3.0
>            Reporter: Raghu Angadi
>            Priority: Major
>
> Protobuf & Avro should support schema evolution in streaming. We need to 
> throw a specific error message when we detect newer version of the the schema 
> in schema registry.
> A couple of options for detecting version change at runtime:
>  * How do we detect newer version from schema registry? It is contacted only 
> during planning currently.
>  * We could detect version id in coming messages.
>  ** What if the id in the incoming message is newer than what our 
> schema-registry reports after the restart?
>  *** This indicates delayed syncs between customers schema-registry servers 
> (should be rare). We can keep erroring out until it is fixed.
>  *** Make sure we log the schema id used during planning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to