[GitHub] [hudi] bvaradar commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer
bvaradar commented on issue #1878: URL: https://github.com/apache/hudi/issues/1878#issuecomment-668076474 Set hoodie.combine.before.insert=true for deduping during bulk insert This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer
bvaradar commented on issue #1878: URL: https://github.com/apache/hudi/issues/1878#issuecomment-667023700 For a monotonically increasing id, you can use bulk-insert instead of insert for first time loading of files, this would nicely order records by the id and your range-pruning during index lookup would be efficient. The parallelism configuration https://hudi.apache.org/docs/configurations.html#withBulkInsertParallelism controls the number of file getting generated. `I will use aws Athena to query all my tables and this specific order table may be delayed up to 15 minutes. I saw that Athena only query Read Optmized MoR, how MoR could help me in this case?` ===>. Would let @umehrot2 answer this question. But If your use-case allows, you can schedule compaction for the MOR table at a frequency to align with a SLA that you want to maintain. This way you can still query those data using RO. For the insert operations, the same config (as in upsert) controls file sizing ('hoodie.parquet.max.file.size') This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer
bvaradar commented on issue #1878: URL: https://github.com/apache/hudi/issues/1878#issuecomment-663813751 This is a spark tuning issue in general. The slowness is due to memory pressure and node failures due to it. Atleast in one of the batches, I see task failures (and retries) during reading from source parquet file itself. As mentioned in the suggestion "Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.", you need to increase spark.yarn.executor.memoryOverhead. You are running 2 executors per machine with 8GB room for each which may not have lot of room. If you are trying to compare parquet write with hudi, note that hudi adds metadata fields which gives incremental pull, indexing and other benefits. If your original record size is very small and comparable to metadata overhead and your setup is already close to hitting the limit for parquet write, then you would need to give more resources. On a related note, since you are trying to use streaming for bootstrapping from a fixed source, have you considered using bulk insert or insert (for size handling) in batch mode which would sort and write the data once. With this mode of incremental inserting, Hudi would try to increase a small file generated in the previous batch. This means that it has to read the small file and apply new insert and write a newer version of the file (which is bigger). As you can see, more the number of iterations here, the more repeated reads will happen. Hence, you would benefit by throwing more resources for a potentially shorter time to do this migration. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org