abhisheksahani91 commented on issue #10270: URL: https://github.com/apache/hudi/issues/10270#issuecomment-1847465601
@ad1happy2go The way we conducted the performance test for Hudi in our pre-production environment is as follows: 1. Bootstrapping the table: We ingested data over Kafka using the Kafka Mongo source connector, which copied the entire Mongo data for a collection containing 100 million records. In this step, Hudi treated this data as an insert. 2. Once the data was loaded into the Hudi table, we ran the Kafka Mongo source connector again to copy the complete data for a collection, which was treated as updates in Hudi. In this scenario, we suspect that during the second step, the Kafka Mongo source connector is ingesting data in the same way as it did in the first step, because of this it will produce delta Avro log files for a limited set of base files, in contrast to the production environment where the update randomness is higher. We are investigating whether we can run the compaction as a separate Glue job, independent of its inclusion in the ingestion pipeline. We will update you with the results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org