[
https://issues.apache.org/jira/browse/YUNIKORN-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188286#comment-17188286
]
Wilfred Spiegelenburg commented on YUNIKORN-387:
------------------------------------------------
I read through the design doc and I see at least one basic problem.
The design is based on a trace implementation in the shim. That goes against
the base design of YuniKorn. The whole reason why we designed the core with
shims is to allow us to do multiple things. The main principle is multiple
shims per core but it also is as being able to run multiple versions of the
same shim and one core. The current build has a core and shim integrated into
one binary. It is not supposed to be one binary and it is only a temporary
implementation. Having the sim and core in one binary is already giving us
issues around dependency management, upgrades and multi version support.
If we now push tracing into the k8shim we will break the core-shim design
completely. I do not think that is thus the right way to go. We must be able to
have a deployment with one core and multiple shims. The core must be the
central point for its own tracing. In case we have multiple shims, multiple k8s
shims or different shims, we can not have one shim being the trace point for
the whole deployment.
One shim does not know what the other shim does or have any knowledge of the
objects. There is no way to make the choice which shim will do the trace
collection. Splitting tracing in the core would then be needed for tracking the
traces of multiple shims, do we have 1 or more shims pulling trace info. How do
we handle the shim not pulling data. Linking the trace collection to the shim
registration would probably be needed too. The core should while running figure
out the availability of a trace collector. Handle the trace collector going
down and data backing up for a period of time.
Shims and the core interact using protobuf and gRPC as defined inthe scheduler
-interface. We should be able to link the tracing between the shim and the core
for the objects we trace. An application comes to mind but it might also be
good for a node etc. I do not see that anywhere in the design.
An application that is traced on the shim side should really be linked to the
traces on the core side. We thus should communicate some context or tags for
the tracing context between a shim and the core.
I also thought that the whole purpose of the distributed tracing was to
implement it independently for all the services and share context where needed.
This design seems to be doing the complete opposite. It uses one interaction
point to push the traces for multiple services. It therefore creates a custom
interface between the two services for which there is no real justification.
> Use Tracing to Improve YuniKorn's Observability
> -----------------------------------------------
>
> Key: YUNIKORN-387
> URL: https://issues.apache.org/jira/browse/YUNIKORN-387
> Project: Apache YuniKorn
> Issue Type: New Feature
> Components: core - scheduler, shim - kubernetes
> Reporter: Weihao Zheng
> Priority: Major
>
> We can use existing tracing framework to collect tracing information in a
> standardized format for scheduling and resource management. It will improve
> YuniKorn's observability significantly with less work. Here are our design
> ideas:
> [https://docs.google.com/document/d/1MKL9SfTH8Pjw6kBM0vRnyv_ctnxBHAz-iuA7Zbux60E/edit?usp=sharing]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]