[GitHub] [hudi] bvaradar commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-08-03 Thread GitBox


bvaradar commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-668076474


   Set hoodie.combine.before.insert=true for deduping during bulk insert 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-07-31 Thread GitBox


bvaradar commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-667023700


   For a monotonically increasing id, you can use bulk-insert instead of insert 
for first time loading of files, this would nicely order records by the id and 
your range-pruning during index lookup would be efficient. The parallelism 
configuration 
https://hudi.apache.org/docs/configurations.html#withBulkInsertParallelism 
controls the number of file getting generated. 
   
   `I will use aws Athena to query all my tables and this specific order table 
may be delayed up to 15 minutes. I saw that Athena only query Read Optmized 
MoR, how MoR could help me in this case?`
 ===>. Would let @umehrot2  answer this question. But If your use-case 
allows, you can schedule compaction for the MOR table at a frequency to align 
with a SLA that you want to maintain. This way you can still query those data 
using RO.
   
   For the insert operations, the same config (as in upsert) controls file 
sizing ('hoodie.parquet.max.file.size')
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-07-24 Thread GitBox


bvaradar commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-663813751


   This is a spark tuning issue in general. The slowness is due to memory 
pressure and node failures due to it. Atleast in one of the batches, I see task 
failures (and retries) during reading from source parquet file itself. 
   
   As mentioned in the suggestion  "Consider boosting 
spark.yarn.executor.memoryOverhead or disabling 
yarn.nodemanager.vmem-check-enabled because of YARN-4714.", you need to 
increase spark.yarn.executor.memoryOverhead. You are running 2 executors per 
machine with 8GB room for each which may not have lot of room. If you are 
trying to compare parquet write with hudi, note that hudi adds metadata fields 
which gives incremental pull, indexing and other benefits. If your original 
record size is very small and comparable to metadata overhead and your setup is 
already close to hitting the limit for parquet write, then you would need to 
give more resources. 
   
   On a related note, since you are trying to use streaming for bootstrapping 
from a fixed source, have you considered using bulk insert or insert (for size 
handling) in batch mode which would sort and write the data once. With this 
mode of incremental inserting, Hudi would try to increase a small file 
generated in the previous batch. This means that it has to read the small file 
and apply new insert and write a newer version of the file (which is bigger). 
As you can see, more the number of iterations here, the more repeated reads 
will happen. Hence, you would benefit by throwing more resources for a 
potentially shorter time to do this migration. 
   

   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org