Thanks Tao for the details, that's very helpful. Really appreciate it! We'll look into the options that you mentioned.
On Thu, Apr 15, 2021 at 11:49 AM Tao Yang <taoy...@apache.org> wrote: > Hi, Chaoran > > Sorry to be late for this response. Yes, We have did some performance > tests and found that the scheduling process is far from transparent at the > beginning, just as you said, the internal metrics is not good enough for us > to spot issues or locate bottlenecks. So we have tried to explored more > approaches to improve the visibility of scheduling process as following: > 1) Broaden the horizon: the scheduling process is just one part in pod > lifecycle, we want to know all the phases in pod lifecycle and know exactly > where is the biggest bottleneck. And we indeed found some bottlenecks which > are much bigger elsewhere in APIServer, some CNI/CSI services or Kubelet, > via monitoring and parsing all key times (e.g. > create/scheduled/started/initialized/ready/containers-ready times) out from > every Pod, aggregating some data, showing them in charts of Grafana UI. > This helps a lot to locate the bottlenecks quickly in whole pod lifecycle. > 2) Dig more details: use existing tracing framework (e.g. OpenTracing) to > collect tracing information in a standardized format for scheduling and > resource management, the traces are following the time and space sequence > of scheduling process, and can be collected periodically or on-demand to > help spotting issues. Please refer to YUNIKORN-387 for details, Weihao > Zheng will keep making effort to this feature. > 3) We also developed a simple profiling tool which is easily to be > injected in any places and give a statistic report periodically or > on-demand, so that we can clearly see the performance details in any > processes. > > Hope this can help. Thanks. > > Regards, > Tao > > Chaoran Yu <yuchaoran2...@gmail.com> 于2021年4月15日周四 上午4:04写道: > >> Hello Tao, >> >> During our discussion with Wilfred yesterday, he mentioned that you folks >> at Alibaba have been running YuniKorn at some decent scale. We are also >> trying some big workloads (Spark batch jobs) with YuniKorn and would like >> to have better visibility in terms of the scheduling performance, and also >> create alerts to help us spot issues as soon as they happen. We found that >> the current list of metrics that are available in the core are not >> comprehensive and some seem to be incorrectly computed. We are reaching out >> to kindly ask you what metrics you have found to be most helpful? Or did >> you add some new metrics? A more generic question is how have you been >> monitoring YuniKorn? Many thanks in advance. >> >> If anyone else on the mailing list has ideas to chime in, that would be >> awesome too. >> >> Regards, >> Chaoran >> >