[GitHub] [hudi] bvaradar commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

GitBox Fri, 24 Jul 2020 22:45:28 -0700


bvaradar commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-663813751



   This is a spark tuning issue in general. The slowness is due to memory 
pressure and node failures due to it. Atleast in one of the batches, I see task 
failures (and retries) during reading from source parquet file itself. 
   
   As mentioned in the suggestion  "Consider boosting 
spark.yarn.executor.memoryOverhead or disabling 
yarn.nodemanager.vmem-check-enabled because of YARN-4714.", you need to 
increase spark.yarn.executor.memoryOverhead. You are running 2 executors per 
machine with 8GB room for each which may not have lot of room. If you are 
trying to compare parquet write with hudi, note that hudi adds metadata fields 
which gives incremental pull, indexing and other benefits. If your original 
record size is very small and comparable to metadata overhead and your setup is 
already close to hitting the limit for parquet write, then you would need to 
give more resources. 
   
   On a related note, since you are trying to use streaming for bootstrapping 
from a fixed source, have you considered using bulk insert or insert (for size 
handling) in batch mode which would sort and write the data once. With this 
mode of incremental inserting, Hudi would try to increase a small file 
generated in the previous batch. This means that it has to read the small file 
and apply new insert and write a newer version of the file (which is bigger). 
As you can see, more the number of iterations here, the more repeated reads 
will happen. Hence, you would benefit by throwing more resources for a 
potentially shorter time to do this migration. 
   
    
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

Reply via email to