[ https://issues.apache.org/jira/browse/HUDI-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prashant Wason closed HUDI-6098. -------------------------------- Resolution: Abandoned https://github.com/apache/hudi/pull/8684 > Initial commit in MDT should use bulk insert for performance > ------------------------------------------------------------ > > Key: HUDI-6098 > URL: https://issues.apache.org/jira/browse/HUDI-6098 > Project: Apache Hudi > Issue Type: Improvement > Reporter: Prashant Wason > Assignee: Prashant Wason > Priority: Major > Labels: pull-request-available > > Initial commit into MDT writes a very large number of records. With indexes > like record-index (to be comitted) the number of written records is in the > order of the total number of records in the dataset itself (could be in > billions). > If we use upsertPrepped to initialize the indexes then: > # The initial commit will write data into log files > # due to the large amount of data the write will be split into a very large > number of log blocks > # performance of lookups from the MDT will suffer greatly until a compaction > is run > # compaction will take all the log data and write into base files (HFiles) > doubling the read/write IO > By directly writing the initial commit into base files using > bulkInsertPrepped API, we can remove all the issues listed above. > This is a critical requirement for large scale indexes like record index. -- This message was sent by Atlassian Jira (v8.20.10#820010)