[ https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prashant Wason reassigned HUDI-797: ----------------------------------- Assignee: Prashant Wason > Improve performance of rewriting AVRO records in > HoodieAvroUtils::rewriteRecord > ------------------------------------------------------------------------------- > > Key: HUDI-797 > URL: https://issues.apache.org/jira/browse/HUDI-797 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Reporter: Prashant Wason > Assignee: Prashant Wason > Priority: Major > > Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO > encoded records. These records have a [schema > |https://avro.apache.org/docs/current/spec.html]which is determined by the > dataset user and provided to HUDI during the writing process (as part of > HUDIWriteConfig). The records are finally saved in [parquet > |https://parquet.apache.org/]files which include the schema (in parquet > format) in the footer of individual files. > > HUDI design requires addition of some metadata fields to all incoming records > to aid in book-keeping and indexing. To achieve this, the incoming schema > needs to be modified by adding the HUDI metadata fields and is called the > HUDI schema for the dataset. Each incoming record is then re-written to > translate it from the incoming schema into the HUDI schema. Re-writing the > incoming records to a new schema is reasonably fast as it looks up all fields > in the incoming record and adds them to a new record. But since this takes > place for each and every incoming record. > When ingestion large datasets (billions of records) or large number of > datasets, even small improvements in the CPU-bound conversion can translate > into notable improvement in compute efficiency. -- This message was sent by Atlassian Jira (v8.3.4#803005)