rubenssoto commented on issue #1878: URL: https://github.com/apache/hudi/issues/1878#issuecomment-663927931
Hi Again. 👍 When I changed the insert option to upsert the performance got worse. 1 Master Node m5.xlarge(4 vcpu, 16gb Ram) 1 Core Node r5.xlarge(4 vcpu, 32gb ram) 4 Task Nodes r5.xlarge(4 vcpu, 32 ram) spark.yarn.executor.memoryOverhead: 2048 Im reading 10 files on each trigger, at the beginning my file size is 1gb each **hudi options** hudi_options = { 'hoodie.table.name': tableName, 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.partitionpath.field': 'event_date', 'hoodie.datasource.write.table.name': tableName, 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.precombine.field': 'LineCreatedTimestamp', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.parquet.small.file.limit': 500000000, 'hoodie.parquet.max.file.size': 900000000, 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': tableName, 'hoodie.datasource.hive_sync.database': 'datalake_raw', 'hoodie.datasource.hive_sync.partition_fields': 'event_date', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.hive_sync.jdbcurl':'jdbc:hive2://ip-10-0-53-190.us-west-2.compute.internal:10000' } I totally understand what you said about hudi metadata and ordering operations, but I'm trying to process only 25gb of data and only on tasks nodes I have more than 100gb of ram, I am probably doing something wrong hehehe All process took 1 hour and 40 minutes. <img width="1680" alt="Captura de Tela 2020-07-25 às 22 03 23" src="https://user-images.githubusercontent.com/36298331/88469957-3c175100-cecd-11ea-9857-a82e9f249fa1.png"> <img width="1680" alt="Captura de Tela 2020-07-25 às 22 04 03" src="https://user-images.githubusercontent.com/36298331/88469959-40436e80-cecd-11ea-98f6-225b9b30f01d.png"> <img width="1680" alt="Captura de Tela 2020-07-25 às 22 03 34" src="https://user-images.githubusercontent.com/36298331/88469961-42a5c880-cecd-11ea-8c28-996afe0e1547.png"> I tried the same operation in batch mode with insert operation it took 46 minutes, the overall performance it seems much better in batch mode like you could see in the follow image <img width="1680" alt="Captura de Tela 2020-07-25 às 23 22 30" src="https://user-images.githubusercontent.com/36298331/88470018-d37ca400-cecd-11ea-817c-ac28f62a2276.png"> but this batch execution created a lot of 50Mb files, is there way to get better? ------------ I think to process big workloads in batch mode with insert operation could be much more scalable, what do you think? My situation is, I have some datasets that I need to process all data, my data has to be deduplicated because is CDC data and after that I need to keep updating the data with streaming. These new datasets will be a source to create many others tables in the company. Could you advise me wich could be the better solution? I think, that I could batch all data and after that keep running a streaming solution to keep the data updated. Last question, when I run in insert mode on streaming job with foreachbatch, hudi will deduplicate only data that exist inside this specific batch? For example, I'm reading 10 files on each trigger, so, if in the next batch trigger has data that exists in the previous batch trigger, data wont be deduplicate, I'm right? Thank you so much, and I'm sorry for a lot of query, but I need to use Hudi on production ASAP ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org