Hello all,

Starting this discussion thread to get your thoughts on "Hudi
Observability".  When running as a spark job, collecting stats that would
a) enable insights on the  bottlenecks, constraints around
partitioning/distribution of work at various Hudi stages
b) enable insights for tuning the job properly to address memory/cpu
related constraints (with potential to auto-tune for different workloads)
c) surface failures caused by outliers (eg., slow/bad executor/host being
used for the job) etc.

I have outlined the approach in the following cwiki document:
RFC 23
<https://cwiki.apache.org/confluence/display/HUDI/RFC+-+23+%3A+Hudi+Observability+metrics+collection>
- Hudi Observability metrics collection

I am sure there are many use-cases/scenarios that I haven't thought about,
where such metrics collection will be useful. Please chime in and share
your thoughts.

Thanks,
Balajee

Reply via email to