santoshsb opened a new issue, #5452:
URL: https://github.com/apache/hudi/issues/5452

   Hi Team,
   
   We are currently evaluating Hudi for our analytical use cases and as part of 
this exercise we are facing few issues with schema evolution and data loss. The 
current issue which we have encountered is while updating a record. We have 
currently inserted a single record with the following schema 
   `
   root
    |-- birthDate: string (nullable = true)
    |-- gender: string (nullable = true)
    |-- id: string (nullable = true)
    |-- lastUpdated: string (nullable = true)
    |-- maritalStatus: struct (nullable = true)
    |    |-- coding: array (nullable = true)
    |    |    |-- element: struct (containsNull = true)
    |    |    |    |-- code: string (nullable = true)
    |    |    |    |-- display: string (nullable = true)
    |    |    |    |-- system: string (nullable = true)
    |    |-- text: string (nullable = true)
    |-- resourceType: string (nullable = true)
    |-- source: string (nullable = true)`
   
   now when we insert the new data with the following schema
   
   `root
    |-- birthDate: string (nullable = true)
    |-- gender: string (nullable = true)
    |-- id: string (nullable = true)
    |-- lastUpdated: string (nullable = true)
    |-- multipleBirthBoolean: boolean (nullable = true)
    |-- resourceType: string (nullable = true)
    |-- source: string (nullable = true)`
   
   The update is successful but the schema is missing the  
   ` |-- maritalStatus: struct (nullable = true)
    |    |-- coding: array (nullable = true)
    |    |    |-- element: struct (containsNull = true)
    |    |    |    |-- code: string (nullable = true)
    |    |    |    |-- display: string (nullable = true)
    |    |    |    |-- system: string (nullable = true)
    |    |-- text: string (nullable = true)`
   
   field.  our expected behaviour was that after adding the second entry, the 
new column "multipleBirthBoolean" will be added to the overall schema and the 
previous column  "maritalStatus" struct will be retained and will be null for 
the second entry.  The final schema looks like this, 
   `root
    |-- _hoodie_commit_time: string (nullable = true)
    |-- _hoodie_commit_seqno: string (nullable = true)
    |-- _hoodie_record_key: string (nullable = true)
    |-- _hoodie_partition_path: string (nullable = true)
    |-- _hoodie_file_name: string (nullable = true)
    |-- birthDate: string (nullable = true)
    |-- gender: string (nullable = true)
    |-- id: string (nullable = true)
    |-- lastUpdated: string (nullable = true)
    |-- multipleBirthBoolean: boolean (nullable = true)
    |-- resourceType: string (nullable = true)
    |-- source: string (nullable = true)`
   
   Basically when a new entry is added and it is missing a column from the 
destination schema the update is successful and the missing column vanishes 
from the previous entries. Let us know if we are missing any configuration 
options.  We cannot control the schema as its defined by FHIR standards 
(https://www.hl7.org/fhir/patient.html#resource) most of the fields here are 
optional so the incoming data from our customers will be missing certain 
columns.
   
   **Environment Description**
   
   * Hudi version : 0.12.0-SNAPSHOT
   
   * Spark version : 3.2.1
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : Local
   
   * Running on Docker? (yes/no) : no
   
   Thanks for the help.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to