Thanks Tao for the details, that's very helpful. Really appreciate it!
We'll look into the options that you mentioned.

On Thu, Apr 15, 2021 at 11:49 AM Tao Yang <taoy...@apache.org> wrote:

> Hi, Chaoran
>
> Sorry to be late for this response. Yes, We have did some performance
> tests and found that the scheduling process is far from transparent at the
> beginning, just as you said, the internal metrics is not good enough for us
> to spot issues or locate bottlenecks. So we have tried to explored more
> approaches to improve the visibility of scheduling process as following:
> 1) Broaden the horizon: the scheduling process is just one part in pod
> lifecycle, we want to know all the phases in pod lifecycle and know exactly
> where is the biggest bottleneck. And we indeed found some bottlenecks which
> are much bigger elsewhere in APIServer, some CNI/CSI services or Kubelet,
> via monitoring and parsing all key times (e.g.
> create/scheduled/started/initialized/ready/containers-ready times) out from
> every Pod, aggregating some data, showing them in charts of Grafana UI.
> This helps a lot to locate the bottlenecks quickly in whole pod lifecycle.
> 2) Dig more details: use existing tracing framework (e.g. OpenTracing) to
> collect tracing information in a standardized format for scheduling and
> resource management, the traces are following the time and space sequence
> of scheduling process, and can be collected periodically or on-demand to
> help spotting issues. Please refer to YUNIKORN-387 for details, Weihao
> Zheng will keep making effort to this feature.
> 3) We also developed a simple profiling tool which is easily to be
> injected in any places and give a statistic report periodically or
> on-demand, so that we can clearly see the performance details in any
> processes.
>
> Hope this can help. Thanks.
>
> Regards,
> Tao
>
> Chaoran Yu <yuchaoran2...@gmail.com> 于2021年4月15日周四 上午4:04写道:
>
>> Hello Tao,
>>
>> During our discussion with Wilfred yesterday, he mentioned that you folks
>> at Alibaba have been running YuniKorn at some decent scale. We are also
>> trying some big workloads (Spark batch jobs) with YuniKorn and would like
>> to have better visibility in terms of the scheduling performance, and also
>> create alerts to help us spot issues as soon as they happen. We found that
>> the current list of metrics that are available in the core are not
>> comprehensive and some seem to be incorrectly computed. We are reaching out
>> to kindly ask you what metrics you have found to be most helpful? Or did
>> you add some new metrics? A more generic question is how have you been
>> monitoring YuniKorn? Many thanks in advance.
>>
>> If anyone else on the mailing list has ideas to chime in, that would be
>> awesome too.
>>
>> Regards,
>> Chaoran
>>
>

Reply via email to