abhisheksahani91 commented on issue #10270:
URL: https://github.com/apache/hudi/issues/10270#issuecomment-1847465601

   @ad1happy2go 
   The way we conducted the performance test for Hudi in our pre-production 
environment is as follows:
   
   1. Bootstrapping the table: We ingested data over Kafka using the Kafka 
Mongo source connector, which copied the entire Mongo data for a collection 
containing 100 million records. In this step, Hudi treated this data as an 
insert.
   2. Once the data was loaded into the Hudi table, we ran the Kafka Mongo 
source connector again to copy the complete data for a collection, which was 
treated as updates in Hudi.
   
   In this scenario, we suspect that during the second step, the Kafka Mongo 
source connector is ingesting data in the same way as it did in the first step, 
because of this it will produce delta Avro log files for a limited set of base 
files, in contrast to the production environment where the update randomness is 
higher.
   
   We are investigating whether we can run the compaction as a separate Glue 
job, independent of its inclusion in the ingestion pipeline. We will update you 
with the results.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to