Hello Tao, During our discussion with Wilfred yesterday, he mentioned that you folks at Alibaba have been running YuniKorn at some decent scale. We are also trying some big workloads (Spark batch jobs) with YuniKorn and would like to have better visibility in terms of the scheduling performance, and also create alerts to help us spot issues as soon as they happen. We found that the current list of metrics that are available in the core are not comprehensive and some seem to be incorrectly computed. We are reaching out to kindly ask you what metrics you have found to be most helpful? Or did you add some new metrics? A more generic question is how have you been monitoring YuniKorn? Many thanks in advance.
If anyone else on the mailing list has ideas to chime in, that would be awesome too. Regards, Chaoran