Hello all, Starting this discussion thread to get your thoughts on "Hudi Observability". When running as a spark job, collecting stats that would a) enable insights on the bottlenecks, constraints around partitioning/distribution of work at various Hudi stages b) enable insights for tuning the job properly to address memory/cpu related constraints (with potential to auto-tune for different workloads) c) surface failures caused by outliers (eg., slow/bad executor/host being used for the job) etc.
I have outlined the approach in the following cwiki document: RFC 23 <https://cwiki.apache.org/confluence/display/HUDI/RFC+-+23+%3A+Hudi+Observability+metrics+collection> - Hudi Observability metrics collection I am sure there are many use-cases/scenarios that I haven't thought about, where such metrics collection will be useful. Please chime in and share your thoughts. Thanks, Balajee
