Hi Thanks for writing this proposal with a detailed design. My comments are inline.
741550557 <[email protected]> 于2019年12月8日周日 下午11:22写道: > Thanks for your reply, I have carefully read these issues you mentioned, > and these issues mentioned are very meaningful and critical. I will give > you technical details about the issues you mentioned below. > I find these issues are related, so I will explain them in different > dimensions. > > > use a different protocol to transmission trace and thread-stack: > 1. add a boolean field in segment data, to record has thread monitored. > and is good for filter monitored trace in UI. > 2. add new BootService, storage Map to record relate trace-id and > trace-stack information. > As we already have designed limit mechanism at backend and agent side(according to your design), also the number would not be big(<10 most likely), we just need a list to storage the trace-id(s) > 3. listen TracingContextListener#afterFinished if the current segment has > thread monitored, mark current trace-id don’t need to monitor anymore. > (Cause if for-each the step 2 map, the remove operation will fail and throw > exception). > 4. when thread-monitor main thread running, It will for-each step 2 map > and check is it don’t need monitor anymore, I will put data into new data > carrier. > 5. generate new thread-monitor gRPC protocol to send data from the data > carrier. The agent side design seems pretty good. > > > the server receives thread-stack logic: > 1. storage stack-stack informations and trace-id/segment-id relations on a > different table. > 2. check thread-monitor is need to be stop on receiving data or schedule. > Could you explain the (2), what do you mean `stop`? I think if your sampling mechanism should include the sampling duration. > > > reduce CPU and memory in sniffer: > 1. through the configuration of thread monitoring in the UI, you can > configure the performance loss. For example, set the monitoring level: fast > monitoring (100ms), medium speed monitoring (500ms), slow speed monitoring > (1000ms). > The sampling period depends on how you are going to visualize it. > 2. add new integer field on per thread-stack, if current thread-stack last > element same as last time, don’t need storage, just increment it. I think > it will save a lot of memory space. Highly doubt about this, reduce the memory, maybe, only reduce if the codes are running the loop or facing lock issue. But if it is neither of these two, they are different. Also, please consider the CPU cost of the comparison of the stack. You need a performance benchmark to verify if you want this. 3. create new VM args to setting thread-monitor pool size, It dependence on > user, maybe default 3? (this can be discussed later) > I think UI limit is enough. 3 seems good to me. > 4. limit thread-stack-element size to 100, I think it can resolve most of > the scenes already. It also can create a new VM args if need. > > > multiple sampling methods can choose :(just my current thoughts, can add > more) > 1. base on current client SamplingServcie, extra a new factor holder to > increment, and reset on schedule. > Yours may be a little more complex than the current SamplingServcie, right? Based on the next rule. 2. `first 5 traces of this endpoint in the next 5 mins`, it a good idea. My > understanding is that within a few minutes, each instance can send a > specified number of traces. > The trace number and time window should be configurable, that is I mean more complex. Inthe current SamplingServcie, only n traces per 3 seconds. But here, it is a dynamic rule. > > > UI settings and sniffer perception: > 1. create a new button on the dashboard page, It can create or stop a > thread-monitor. It can be dynamic load thread-monitor status when > reselecting endpoint. > I think at least should be a level one new page called configuration or command page, which could set up the multiple sampling rule and visualize the existing tasks and related sampling data. > 2. sniffer creates a new scheduled task to check the current service has > need monitor endpoint each 5 seconds. (I see current sniffer has command > functions, feel that principle is the same as the scheduler) > Seems reasonable. > > thread-monitor on the UI:(That’s just my initial thoughts, I think there > will have a better way to show) > 1. When switch to the trace page, I think we need to add a new switch > button to filter thread-monitor trace. > 2. make a new thread-monitor icon on the same segment. It means it has > thread-stack information. > We don't have separated thread monitor view table, how about we add an icon at the segment list, and add icon at the first span of this segment in trace detail view? I think the latter one should be an entrance of the thread view. 3. show on the information sidebox when the user clicks the thread-monitor > segment(any span). create a new tab, like the log tab. > If you have some visualization idea, drawn by any tool you like supporting comment, we could discuss it there. In my mind, we should support visualize the thread dump stack through the time windows, and support aggregate them by choosing the continued stack snapshots on the time window. > > They're just a description of my current implementation details for > thread-monitor if these seem to work. I can do some time planning for these > tasks. Sorry, my English is not very well, hope you can understand. Maybe > these seem to have some problem, any good idea or suggestion are welcome. > Very appreciated you to lead this new direction. It is a long term task but should be interesting. :) Good work, carry on. > > > > > 原始邮件 > 发件人:Sheng [email protected] > 收件人:[email protected] > 发送时间:2019年12月8日(周日) 08:31 > 主题:Re: A proposal for Skywalking(thread monitor) > > > First of all, thanks for your proposal. Thread monitoring is super > important for application performance. So basically, I agree with this > proposal. But for tech details, I think we need more discussion in the > following ways 1. Do you want to add thread status to the trace? If so, why > don't consider this as a UI level join? Because we could know thread id in > the trace when we create a span, right? Then we have all the thread > dump(if), we could ask UI to query specific thread context based on > timestamp and thread number(s). 2. For thread dump, I don't know whether > you do the performance evaluation for this OP. From my experiences, `get > all need thread monitor segment every 100 milliseconds` is a very high cost > in your application and agent. So, you may need to be careful about doing > this. 3. Endpoint related thread dump with some sampling mechanisms makes > more sense to me. And this should be activated by UI. We should only > provide a conditional thread dump sampling mechanism, such as `first 5 > traces of this endpoint in the next 5 mins`. Jian Tan I think DaoCloud also > has customized this feature in your internal SkyWalking. Could you share > what you do? Sheng Wu 吴晟 Twitter, wusheng1108 741550557 [email protected] > 于2019年12月8日周日 上午12:14写道: Hello everyone, I would like to share a new > feature with skywalking, called “thread monitor”. Background When our > company used skywalking to APM earlier, we found that many traces did not > have enough span to fill up, doubting whether there were some third-party > frameworks that we didn't enhance or programmers API usage errors such as > java CountDown number is 3 but there are only 2 countdowns. So we decide > to write a new feature to monitor executing trace thread stack, then we > can get more information on the trace, quick known what’s happening on > that trace. Structure Agent(thread monitor) — gRPC protocol — OAP > Server(Storage) — Skywalking-Rocketbot-UI More detail OAP Server: > 1. Storage witch traces need to monitor(i suggest storage on the endpoint, > add new boolean field named needThreadMonitor) 2. Provide GraphQL API to > change endpoint monitor status. 3. Monitor Trace parse, storage thread > stack if the segment has any thread info. Skywalking-Rocketbot-UI: 1. > Add a new switch button on the dashboard, It can read or modify endpoint > status. 2. It will show every thread stack on click trace detail. > Agent: 1. setup two new BootService: 1) find any need thread monitor > endpoint in current service, start on a new schedule take and works on > each minute. 2) start new schedule task to get all need thread monitor > segment each 100 milliseconds, and put a new thread dump task to a global > thread pool(fixed, count number default 3). 2. check endpoint need thread > monitor on create entry/local span(TracingConext#createEntry/LocalSpan). > If need, It will be marked and put into thread monitor map. 3. when > TraceingContext finishes, It will get thread has monitored, and send all > thread stack to server. Finally, I don’t know it is a good idea to get > more information on trace? If you have any good ideas or suggestions on > this, please let me know. Mrpro
