[jira] [Commented] (YUNIKORN-387) Use Tracing to Improve YuniKorn's Observability

Wilfred Spiegelenburg (Jira) Tue, 01 Sep 2020 02:12:58 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188286#comment-17188286
 ]


Wilfred Spiegelenburg commented on YUNIKORN-387:
------------------------------------------------

I read through the design doc and I see at least one basic problem.

The design is based on a trace implementation in the shim. That goes against 
the base design of YuniKorn. The whole reason why we designed the core with 
shims is to allow us to do multiple things. The main principle is multiple 
shims per core but it also is as being able to run multiple versions of the 
same shim and one core. The current build has a core and shim integrated into 
one binary. It is not supposed to be one binary and it is only a temporary 
implementation. Having the sim and core in one binary is already giving us 
issues around dependency management, upgrades and multi version support.

If we now push tracing into the k8shim we will break the core-shim design 
completely. I do not think that is thus the right way to go. We must be able to 
have a deployment with one core and multiple shims. The core must be the 
central point for its own tracing. In case we have multiple shims, multiple k8s 
shims or different shims, we can not have one shim being the trace point for 
the whole deployment.
One shim does not know what the other shim does or have any knowledge of the 
objects. There is no way to make the choice which shim will do the trace 
collection. Splitting tracing in the core would then be needed for tracking the 
traces of multiple shims, do we have 1 or more shims pulling trace info. How do 
we handle the shim not pulling data. Linking the trace collection to the shim 
registration would probably be needed too. The core should while running figure 
out the availability of a trace collector. Handle the trace collector going 
down and data backing up for a period of time.

Shims and the core interact using protobuf and gRPC as defined inthe scheduler 
-interface. We should be able to link the tracing between the shim and the core 
for the objects we trace. An application comes to mind but it might also be 
good for a node etc. I do not see that anywhere in the design.
An application that is traced on the shim side should really be linked to the 
traces on the core side. We thus should communicate some context or tags for 
the tracing context between a shim and the core. 

I also thought that the whole purpose of the distributed tracing was to 
implement it independently for all the services and share context where needed. 
This design seems to be doing the complete opposite. It uses one interaction 
point to push the traces for multiple services. It therefore creates a custom 
interface between the two services for which there is no real justification.


> Use Tracing to Improve YuniKorn's Observability
> -----------------------------------------------
>
>                 Key: YUNIKORN-387
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-387
>             Project: Apache YuniKorn
>          Issue Type: New Feature
>          Components: core - scheduler, shim - kubernetes
>            Reporter: Weihao Zheng
>            Priority: Major
>
> We can use existing tracing framework to collect tracing information in a 
> standardized format for scheduling and resource management. It will improve 
> YuniKorn's observability significantly with less work. Here are our design 
> ideas: 
> [https://docs.google.com/document/d/1MKL9SfTH8Pjw6kBM0vRnyv_ctnxBHAz-iuA7Zbux60E/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-387) Use Tracing to Improve YuniKorn's Observability

Reply via email to