[jira] [Commented] (HUDI-603) HoodieDeltaStreamer should periodically fetch table schema update

Yixue (Andrew) Zhu (Jira) Sun, 23 Feb 2020 10:31:28 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043016#comment-17043016
 ]


Yixue (Andrew) Zhu commented on HUDI-603:
-----------------------------------------

I think one possible approach would work:
 # A SchemaProvider derived class can be introduced to retrieve latest Schema 
if needed, from Confluence Schema registry. 
 # Enhance class AvroSource or Source derived class to record Avro schema id 
for serialization, as used by Confluence Schema registry. When deserialized for 
compaction, translate to refreshed schema (shortcut if schema ids match).
 # Custom serializer for GenericRecord can be registered in Spark, to use 
schema id.

> HoodieDeltaStreamer should periodically fetch table schema update
> -----------------------------------------------------------------
>
>                 Key: HUDI-603
>                 URL: https://issues.apache.org/jira/browse/HUDI-603
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>          Components: DeltaStreamer
>            Reporter: Yixue Zhu
>            Priority: Major
>              Labels: evolution, schema
>
> HoodieDeltaStreamer create SchemaProvider instance and delegate to DeltaSync 
> for periodical sync. However, default implementation of SchemaProvider does 
> not refresh schema, which can change due to schema evolution. DeltaSync 
> snapshot the schema when it creates writeClient, using the SchemaProvider 
> instance or pick up from source, and the schema for writeClient is not 
> refreshed during the loop of Sync.
> I think this needs to be addressed to support schema evolution fully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-603) HoodieDeltaStreamer should periodically fetch table schema update

Reply via email to