pengxianzi commented on issue #12589: URL: https://github.com/apache/hudi/issues/12589#issuecomment-2574618976
> For migration, maybe you can use the bulk_insert to write the history dataset from Kudu in batch execution mode, you can then ingest into this Hudi table switching to the upsert operation. > > The write is slow for cow because for each checkpointing, the cow would trigger a whole table rewrite, this is also true for mor compaction. > > So maybe you can migrate the existing data set from Kudu using the bulk_insert operation, and do streaming upsert with the incremental inputs. If the data set itself is huge, partition table using datetime should also be helpful because that would reduce the scope for rewrite significally. Thank you for your suggestions! Our current approach for large tables aligns with your recommendations: We use the bulk_insert operation to migrate historical data from Kudu to the Hudi table. We switch to the upsert operation for incremental data writes. However, we encountered the following issues during implementation: Necessity of Bucketing: Without bucketing, data duplication occurs during Flink writes. Only after adding bucketing is the data duplication issue resolved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
