[ https://issues.apache.org/jira/browse/YUNIKORN-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200663#comment-17200663 ]
Weihao Zheng commented on YUNIKORN-387: --------------------------------------- Hi [~wilfreds] and [~adam.antal]! Thanks for your further comments. We agree that we don't need to implement the UI and REST for showing traces. Existing tools are enough for our requirements. Currently we only keep the REST to control on-demand tracing. We must use extra analysis tools to aggregate the traces we collected and get metrics results. I think metrics from tracing are more flexible than Prometheus metrics because we can change the configuration of analysis tools dynamically, such as the query we use to get another aspect of these traces, and get aggregation data base on the historical traces we stored. So it can be use in some on-demand monitoring situation. We think tracing the scheduling process will provide much detail for the core part so we focus on it in current design. Tracing objects and states in the core is also an important topic. We regard it as a natural continuation of tracing shim’s objects and states because all these resources’ requests and corresponding traces begin from the shim and the way we trace resources in the core and the shim will not have many differences. So we don’t mention resource tracer in the core. We can set span’s SamplingPriority to 1 to force jaeger to collect this trace. We can develop the on-demand feature based on that. Sampling is conflict with the counter metrics. We can use Prometheus to collect these counter metrics if we don’t use the const sampler. Sampling is still useful in average metrics or metrics to draw distribution graph. > Use Tracing to Improve YuniKorn's Observability > ----------------------------------------------- > > Key: YUNIKORN-387 > URL: https://issues.apache.org/jira/browse/YUNIKORN-387 > Project: Apache YuniKorn > Issue Type: New Feature > Components: core - scheduler, shim - kubernetes > Reporter: Weihao Zheng > Priority: Major > > We can use existing tracing framework to collect tracing information in a > standardized format for scheduling and resource management. It will improve > YuniKorn's observability significantly with less work. Here are our design > ideas: > [https://docs.google.com/document/d/1MKL9SfTH8Pjw6kBM0vRnyv_ctnxBHAz-iuA7Zbux60E/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org