[ https://issues.apache.org/jira/browse/HUDI-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ethan Guo updated HUDI-3217: ---------------------------- Description: * [P0]HUDI-6702 Extend merge API to support all merging operations (inserts, updates and deletes, including customized getInsertValue) ** Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older, Schema oldSchema, Option<HoodieRecord> newer, Schema newSchema, TypedProperties props) * [P0]HUDI-6765 Add merge mode to allow differentiation of dedup logic ** Add a new argument of merge mode (pre-combine, or update) to the merge API for customized dedup (or merging of log records?), instead of using OperationModeAwareness * [P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion ** HoodieRecordCompatibilityInterface provides adaption among any representation type (Avro, Row, etc.) ** Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink). For Avro log block, needs conversion from Avro to Row for Spark * [P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing ** HoodieRecord does not merely wrap engine-specific data structure; it also contains Java objects to store record key, location, etc. ** For end-to-end row writing, could we just use engine-specific type InternalRow instead of HoodieRecord<InternalRow> by appending key, location, etc. as row fields, to better leverage Spark's optimization on DataFrame with InternalRow? * [P0] Bug fixes ** HUDI-5807 HoodieSparkParquetReader is not appending partition-path values These are nice-to-haves but not on the critical path * * [P1] Make merge logic engine-agnostic ** Different engines need to implement the merging logic based in the engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.) different HoodieRecordMerger implementation class. Providing getField API from the HoodieRecord could allow engine-agnostic merge logic. * [P1]HUDI-5249HUDI-5282 Implement MDT payload using new merge API ** Only necessary if we use parquet as the base and log file format in MDT * [P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord ** As we will implement a new file-group readers and writers, we do not need to fix existing readers now — OLD PLAN — Currently Hudi is biased t/w assumption of particular payload representation (Avro), long-term we would like to steer away from this to keep the record payload be completely opaque, so that # We can keep record payload representation engine-specific # Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > Binary) h2. *Proposal* *Phase 2: Revisiting Record Handling* {_}T-shirt{_}: 2-2.5 weeks {_}Goal{_}: Avoid tight coupling with particular record representation on the Read Path (currently Avro) and enable * Revisit RecordPayload APIs * ** Deprecate {{getInsertValue}} and {{combineAndGetUpdateValue}} APIs replacing w/ new “opaque” APIs (not returning Avro payloads) ** Rebase RecordPayload hierarchy to be engine-specific: *** Common engine-specific base abstracting common functionality (Spark, Flink, Java) *** Each feature-specific semantic will have to implement for all engines ** Introduce new APIs *** To access keys (record, partition) *** To convert record to Avro (for BWC) * Revisit RecordPayload handling ** In WriteHandles *** API will be accepting opaque RecordPayload (no Avro conversion) *** Can do (opaque) record merging if necessary *** Passes RP as is to FileWriter ** In FileWriters *** Will accept RecordPayload interface *** Should be engine-specific (to handle internal record representation ** In RecordReaders *** API will be providing opaque RecordPayload (no Avro conversion) was: * [P0]HUDI-6702 Extend merge API to support all merging operations (inserts, updates and deletes, including customized getInsertValue) ** Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older, Schema oldSchema, Option<HoodieRecord> newer, Schema newSchema, TypedProperties props) * [P0]HUDI-6765 Add merge mode to allow differentiation of dedup logic ** Add a new argument of merge mode (pre-combine, or update) to the merge API for customized dedup (or merging of log records?), instead of using OperationModeAwareness * [P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion ** HoodieRecordCompatibilityInterface provides adaption among any representation type (Avro, Row, etc.) ** Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink). For Avro log block, needs conversion from Avro to Row for Spark * [P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing ** HoodieRecord does not merely wrap engine-specific data structure; it also contains Java objects to store record key, location, etc. ** For end-to-end row writing, could we just use engine-specific type InternalRow instead of HoodieRecord<InternalRow> by appending key, location, etc. as row fields, to better leverage Spark's optimization on DataFrame with InternalRow? * [P0] Bug fixes ** HUDI-5807 HoodieSparkParquetReader is not appending partition-path values These are nice-to-haves but not on the critical path * [P1] Make merge logic engine-agnostic ** Different engines need to implement the merging logic based in the engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.) different HoodieRecordMerger implementation class. Providing getField API from the HoodieRecord could allow engine-agnostic merge logic. * [P1]HUDI-5249HUDI-5282 Implement MDT payload using new merge API ** Only necessary if we use parquet as the base and log file format in MDT * [P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord ** As we will implement a new file-group readers and writers, we do not need to fix existing readers now — OLD PLAN — Currently Hudi is biased t/w assumption of particular payload representation (Avro), long-term we would like to steer away from this to keep the record payload be completely opaque, so that # We can keep record payload representation engine-specific # Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > Binary) h2. *Proposal* *Phase 2: Revisiting Record Handling* {_}T-shirt{_}: 2-2.5 weeks {_}Goal{_}: Avoid tight coupling with particular record representation on the Read Path (currently Avro) and enable * Revisit RecordPayload APIs * ** Deprecate {{getInsertValue}} and {{combineAndGetUpdateValue}} APIs replacing w/ new “opaque” APIs (not returning Avro payloads) ** Rebase RecordPayload hierarchy to be engine-specific: *** Common engine-specific base abstracting common functionality (Spark, Flink, Java) *** Each feature-specific semantic will have to implement for all engines ** Introduce new APIs *** To access keys (record, partition) *** To convert record to Avro (for BWC) * Revisit RecordPayload handling ** In WriteHandles *** API will be accepting opaque RecordPayload (no Avro conversion) *** Can do (opaque) record merging if necessary *** Passes RP as is to FileWriter ** In FileWriters *** Will accept RecordPayload interface *** Should be engine-specific (to handle internal record representation ** In RecordReaders *** API will be providing opaque RecordPayload (no Avro conversion) > RFC-46: Optimize Record Payload handling > ---------------------------------------- > > Key: HUDI-3217 > URL: https://issues.apache.org/jira/browse/HUDI-3217 > Project: Apache Hudi > Issue Type: Epic > Components: storage-management, writer-core > Reporter: Alexey Kudinkin > Assignee: Ethan Guo > Priority: Critical > Labels: hudi-umbrellas, pull-request-available > Fix For: 1.0.0 > > > * [P0]HUDI-6702 Extend merge API to support all merging operations (inserts, > updates and deletes, including customized getInsertValue) > ** Option<Pair<HoodieRecord, Schema>> merge(Option<HoodieRecord> older, > Schema oldSchema, Option<HoodieRecord> newer, Schema newSchema, > TypedProperties props) > * [P0]HUDI-6765 Add merge mode to allow differentiation of dedup logic > ** Add a new argument of merge mode (pre-combine, or update) to the merge > API for customized dedup (or merging of log records?), instead of using > OperationModeAwareness > * [P0?]HUDI-6767 Simplify compatibility of HoodieRecord conversion > ** HoodieRecordCompatibilityInterface provides adaption among any > representation type (Avro, Row, etc.) > ** Guarantee one type end-to-end: Avro, Row for Spark (RowData for Flink). > For Avro log block, needs conversion from Avro to Row for Spark > * [P0]HUDI-6768 Revisit HoodieRecord design and how it affects e2e row writing > ** HoodieRecord does not merely wrap engine-specific data structure; it also > contains Java objects to store record key, location, etc. > ** For end-to-end row writing, could we just use engine-specific type > InternalRow instead of HoodieRecord<InternalRow> by appending key, location, > etc. as row fields, to better leverage Spark's optimization on DataFrame with > InternalRow? > * [P0] Bug fixes > ** HUDI-5807 HoodieSparkParquetReader is not appending partition-path values > These are nice-to-haves but not on the critical path > * * [P1] Make merge logic engine-agnostic > ** Different engines need to implement the merging logic based in the > engine-specific data structure (Spark's InternalRow, Flink's RowData, etc.) > different HoodieRecordMerger implementation class. Providing getField API > from the HoodieRecord could allow engine-agnostic merge logic. > * [P1]HUDI-5249HUDI-5282 Implement MDT payload using new merge API > ** Only necessary if we use parquet as the base and log file format in MDT > * [P1]HUDI-3354 Existing engine-specific readers to use HoodieRecord > ** As we will implement a new file-group readers and writers, we do not need > to fix existing readers now > — OLD PLAN — > Currently Hudi is biased t/w assumption of particular payload representation > (Avro), long-term we would like to steer away from this to keep the record > payload be completely opaque, so that > # We can keep record payload representation engine-specific > # Avoid unnecessary serde loops (Engine-specific > Avro > Engine-specific > > Binary) > h2. *Proposal* > > *Phase 2: Revisiting Record Handling* > {_}T-shirt{_}: 2-2.5 weeks > {_}Goal{_}: Avoid tight coupling with particular record representation on the > Read Path (currently Avro) and enable > * Revisit RecordPayload APIs > * > ** Deprecate {{getInsertValue}} and {{combineAndGetUpdateValue}} APIs > replacing w/ new “opaque” APIs (not returning Avro payloads) > ** Rebase RecordPayload hierarchy to be engine-specific: > *** Common engine-specific base abstracting common functionality (Spark, > Flink, Java) > *** Each feature-specific semantic will have to implement for all engines > ** Introduce new APIs > *** To access keys (record, partition) > *** To convert record to Avro (for BWC) > * Revisit RecordPayload handling > ** In WriteHandles > *** API will be accepting opaque RecordPayload (no Avro conversion) > *** Can do (opaque) record merging if necessary > *** Passes RP as is to FileWriter > ** In FileWriters > *** Will accept RecordPayload interface > *** Should be engine-specific (to handle internal record representation > ** In RecordReaders > *** API will be providing opaque RecordPayload (no Avro conversion) > > -- This message was sent by Atlassian Jira (v8.20.10#820010)