[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.
asharma4-lucid commented on issue #2269: URL: https://github.com/apache/hudi/issues/2269#issuecomment-736998571 Thanks @bvaradar. Would you know when 0.7.0 is slated for release as the S3 listing time will continue to grow for us as we add more partitions even with cleaning turned off? Also, since we are using a COW table and mostly inserts, would new versions of files still be created and hence would old versions be required to be cleaned up? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.
asharma4-lucid commented on issue #2269: URL: https://github.com/apache/hudi/issues/2269#issuecomment-734453893 Is there a downside to keeping hoodie.clean.automatic=false? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.
asharma4-lucid commented on issue #2269: URL: https://github.com/apache/hudi/issues/2269#issuecomment-734450744 Thanks @bvaradar . Setting the value hoodie.clean.automatic=false has helped in reducing the processing time significantly. Now the 5 records got inserted in less than a minute. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.
asharma4-lucid commented on issue #2269: URL: https://github.com/apache/hudi/issues/2269#issuecomment-733323629 Yes this is a COW table. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.
asharma4-lucid commented on issue #2269: URL: https://github.com/apache/hudi/issues/2269#issuecomment-733174238 Thanks @bvaradar. I tried to insert just 5 records to the existing table with ~300K partitions and it took close to ~5 hrs. If I insert ~5 records in a new table it takes less than 2 mins. Is this extra time of ~5 hrs all because of cleaner and compaction processes? For our use case, we mostly get inserts. With that in mind, would it be beneficial for us if we switch to MOR from COW and do async compaction (I am most likely making an incorrect assumption that this huge extra processing time is only because of compaction) ? And also, since our data does not have frequent record level updates, would switching to MOR make any difference? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.
asharma4-lucid commented on issue #2269: URL: https://github.com/apache/hudi/issues/2269#issuecomment-731908622 Thanks @bvaradar for your response. I have a few more questions: 1) The reason we have kept the partition key that we are using is, because we wanted to gain O(1) read performance for the same. It is my understanding that this many number of partitions puts memory pressure on the executors as each executor creates as many writers as the partitions. (I am assuming in HDFS namenode would also be impacted, but since we are using S3, I am discounting that, but do let me know if I am mistaken). It is here that I wanted to confirm my understanding. Every day our process will update around ~12K partitions + insert ~33 K new partitions. So, my question is will the executors doing the hudi table write create ~44K writers contributing to the memory pressure. Or will the already existing partitions, i.e. ~300K also be touched in some way by the hudi table write executors leading to performance degradation as we continue to add more table to the hudi table? 2) Just to confirm my understanding, when you mentioned s3 listing as the bottleneck, you meant that the s3 listing of all the partitions and files for the hudi table and not just the partitions updated and/or inserted for that specific process. So, in my case, that would imply that the Hudi table write process is doing an s3 listing of already existing ~300 K partitions and associated files and not just the ~44K partitions for the specific execution. And this is probably in line with what we have observed as well. Because for the intial 15 day processes, each hudi table write completed in around 4 hrs and then from the 16th day onwards, it gradually started increasing from 4 to 5 to 6 and now to almost 9 hrs per day as we move ahead. Can you please confirm? 3) If s3 listing requirement is made optional in hudi 0.7.0, then can we continue to use the partition key that we are using assuming that every day our process will add/update ~44K partitions in the hudi table? I understand that is not the best partition key as it has very high cardinality, but our read requirement is what is driving us towards this. I guess this might be related to question 1 above, but my question is, is there any other downside as well that you could glean from our use of this partition key apart from the s3 listing dependency? 4) We are trying to see if spark bucketing on the key would be a good middle ground between partition on the key and not using partitioning. Does hudi table write support bucketed writes and consequently, are the hudi table reads able to use the buckets for optimal read performance? Something like, O(1) hash + O(log m) binary search where m is the number of records in each bucketed file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org