[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

GitBox Sat, 25 Jul 2020 19:37:15 -0700


rubenssoto commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-663927931



   Hi Again. 👍 
   
   When I changed the insert option to upsert the performance got worse.
   1 Master Node m5.xlarge(4 vcpu, 16gb Ram)
   1 Core Node r5.xlarge(4 vcpu, 32gb ram)
   4 Task Nodes r5.xlarge(4 vcpu, 32 ram)
   spark.yarn.executor.memoryOverhead: 2048
   Im reading 10 files on each trigger, at the beginning my file size is 1gb 
each
   
   **hudi options**
   hudi_options = {
     'hoodie.table.name': tableName,
     'hoodie.datasource.write.recordkey.field': 'id',
     'hoodie.datasource.write.partitionpath.field': 'event_date',
     'hoodie.datasource.write.table.name': tableName,
     'hoodie.datasource.write.operation': 'upsert',
     'hoodie.datasource.write.precombine.field': 'LineCreatedTimestamp',
     'hoodie.datasource.write.hive_style_partitioning': 'true',
     'hoodie.parquet.small.file.limit': 500000000,
     'hoodie.parquet.max.file.size': 900000000,
     'hoodie.datasource.hive_sync.enable': 'true',
     'hoodie.datasource.hive_sync.table': tableName,
     'hoodie.datasource.hive_sync.database': 'datalake_raw',
     'hoodie.datasource.hive_sync.partition_fields': 'event_date',
     'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
     
'hoodie.datasource.hive_sync.jdbcurl':'jdbc:hive2://ip-10-0-53-190.us-west-2.compute.internal:10000'
   }
   
   I totally understand what you said about hudi metadata and ordering 
operations, but I'm trying to process only 25gb of data and only on tasks nodes 
I have more than 100gb of ram, I am probably  doing something wrong hehehe
   All process took 1 hour and 40 minutes.
   
   <img width="1680" alt="Captura de Tela 2020-07-25 às 22 03 23" 
src="https://user-images.githubusercontent.com/36298331/88469957-3c175100-cecd-11ea-9857-a82e9f249fa1.png";>
   <img width="1680" alt="Captura de Tela 2020-07-25 às 22 04 03" 
src="https://user-images.githubusercontent.com/36298331/88469959-40436e80-cecd-11ea-98f6-225b9b30f01d.png";>
   <img width="1680" alt="Captura de Tela 2020-07-25 às 22 03 34" 
src="https://user-images.githubusercontent.com/36298331/88469961-42a5c880-cecd-11ea-8c28-996afe0e1547.png";>
   
   
   I tried the same operation in batch mode with insert operation it took 46 
minutes, the overall performance it seems much better in batch mode like you 
could see in the follow image
   <img width="1680" alt="Captura de Tela 2020-07-25 às 23 22 30" 
src="https://user-images.githubusercontent.com/36298331/88470018-d37ca400-cecd-11ea-817c-ac28f62a2276.png";>
   
   but this batch execution created a lot of 50Mb files, is there way to get 
better?
   
   ------------
   
   I think to process big workloads in batch mode with insert operation could 
be much more scalable, what do you think? My situation is, I have some datasets 
that I need to process all data, my data has to be deduplicated because is CDC 
data and after that I need to keep updating the data with streaming. These new 
datasets will be a source to create many others tables in the company.
   
   Could you advise me wich could be the better solution? I think, that I could 
batch all data and after that keep running a streaming solution to keep the 
data updated.
   
   Last question, when I run in insert mode on streaming job with foreachbatch, 
hudi will deduplicate only data that exist inside this specific batch? For 
example, I'm reading 10 files on each trigger, so, if in the next batch trigger 
has data that exists in the previous batch trigger, data wont be deduplicate, 
I'm right?
   
   Thank you so much, and I'm sorry for a lot of query, but I need to use Hudi 
on production ASAP
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

Reply via email to