[ 
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-797:
-----------------------------------

    Assignee: Prashant Wason

> Improve performance of rewriting AVRO records in 
> HoodieAvroUtils::rewriteRecord
> -------------------------------------------------------------------------------
>
>                 Key: HUDI-797
>                 URL: https://issues.apache.org/jira/browse/HUDI-797
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO 
> encoded records. These records have a [schema 
> |https://avro.apache.org/docs/current/spec.html]which is determined by the 
> dataset user and provided to HUDI during the writing process (as part of 
> HUDIWriteConfig). The records are finally saved in [parquet 
> |https://parquet.apache.org/]files which include the schema (in parquet 
> format) in the footer of individual files.
>  
> HUDI design requires addition of some metadata fields to all incoming records 
> to aid in book-keeping and indexing. To achieve this, the incoming schema 
> needs to be modified by adding the HUDI metadata fields and is called the 
> HUDI schema for the dataset. Each incoming record is then re-written to 
> translate it from the incoming schema into the HUDI schema. Re-writing the 
> incoming records to a new schema is reasonably fast as it looks up all fields 
> in the incoming record and adds them to a new record. But since this takes 
> place for each and every incoming record. 
> When ingestion large datasets (billions of records) or large number of 
> datasets, even small improvements in the CPU-bound conversion can translate 
> into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to