Thanks for your reply, I have carefully read these issues you mentioned, and 
these issues mentioned are very meaningful and critical. I will give you 
technical details about the issues you mentioned below.
I find these issues are related, so I will explain them in different dimensions.


use a different protocol to transmission trace and thread-stack:
1. add a boolean field in segment data, to record has thread monitored. and is 
good for filter monitored trace in UI.
2. add new BootService, storage Map to record relate trace-id and trace-stack 
information.
3. listen TracingContextListener#afterFinished if the current segment has 
thread monitored, mark current trace-id don’t need to monitor anymore. (Cause 
if for-each the step 2 map, the remove operation will fail and throw exception).
4. when thread-monitor main thread running, It will for-each step 2 map and 
check is it don’t need monitor anymore, I will put data into new data carrier.
5. generate new thread-monitor gRPC protocol to send data from the data carrier.


the server receives thread-stack logic:
1. storage stack-stack informations and trace-id/segment-id relations on a 
different table.
2. check thread-monitor is need to be stop on receiving data or schedule.


reduce CPU and memory in sniffer:
1. through the configuration of thread monitoring in the UI, you can configure 
the performance loss. For example, set the monitoring level: fast monitoring 
(100ms), medium speed monitoring (500ms), slow speed monitoring (1000ms).
2. add new integer field on per thread-stack, if current thread-stack last 
element same as last time, don’t need storage, just increment it. I think it 
will save a lot of memory space.
3. create new VM args to setting thread-monitor pool size, It dependence on 
user, maybe default 3? (this can be discussed later)
4. limit thread-stack-element size to 100, I think it can resolve most of the 
scenes already. It also can create a new VM args if need.


multiple sampling methods can choose :(just my current thoughts, can add more)
1. base on current client SamplingServcie, extra a new factor holder to 
increment, and reset on schedule.
2. `first 5 traces of this endpoint in the next 5 mins`, it a good idea. My 
understanding is that within a few minutes, each instance can send a specified 
number of traces.


UI settings and sniffer perception:
1. create a new button on the dashboard page, It can create or stop a 
thread-monitor. It can be dynamic load thread-monitor status when reselecting 
endpoint.
2. sniffer creates a new scheduled task to check the current service has need 
monitor endpoint each 5 seconds. (I see current sniffer has command functions, 
feel that principle is the same as the scheduler)


thread-monitor on the UI:(That’s just my initial thoughts, I think there will 
have a better way to show)
1. When switch to the trace page, I think we need to add a new switch button to 
filter thread-monitor trace.
2. make a new thread-monitor icon on the same segment. It means it has 
thread-stack information.
3. show on the information sidebox when the user clicks the thread-monitor 
segment(any span). create a new tab, like the log tab.


They're just a description of my current implementation details for 
thread-monitor if these seem to work. I can do some time planning for these 
tasks. Sorry, my English is not very well, hope you can understand. Maybe these 
seem to have some problem, any good idea or suggestion are welcome.




原始邮件
发件人:Sheng [email protected]
收件人:[email protected]
发送时间:2019年12月8日(周日) 08:31
主题:Re: A proposal for Skywalking(thread monitor)


First of all, thanks for your proposal. Thread monitoring is super important 
for application performance. So basically, I agree with this proposal. But for 
tech details, I think we need more discussion in the following ways 1. Do you 
want to add thread status to the trace? If so, why don't consider this as a UI 
level join? Because we could know thread id in the trace when we create a span, 
right? Then we have all the thread dump(if), we could ask UI to query specific 
thread context based on timestamp and thread number(s). 2. For thread dump, I 
don't know whether you do the performance evaluation for this OP. From my 
experiences, `get all need thread monitor segment every 100 milliseconds` is a 
very high cost in your application and agent. So, you may need to be careful 
about doing this. 3. Endpoint related thread dump with some sampling mechanisms 
makes more sense to me. And this should be activated by UI. We should only 
provide a conditional thread dump sampling mechanism, such as `first 5 traces 
of this endpoint in the next 5 mins`. Jian Tan I think DaoCloud also has 
customized this feature in your internal SkyWalking. Could you share what you 
do? Sheng Wu 吴晟 Twitter, wusheng1108 741550557 [email protected] 于2019年12月8日周日 
上午12:14写道:  Hello everyone,    I would like to share a new feature with 
skywalking, called “thread  monitor”.    Background  When our company used 
skywalking to APM earlier, we found that many traces  did not have enough span 
to fill up, doubting whether there were some  third-party frameworks that we 
didn't enhance or programmers API usage  errors such as java CountDown number 
is 3 but there are only 2 countdowns.  So we decide to write a new feature to 
monitor executing trace thread  stack, then we can get more information on the 
trace, quick known what’s  happening on that trace.      Structure  
Agent(thread monitor) — gRPC protocol — OAP Server(Storage) —  
Skywalking-Rocketbot-UI      More detail  OAP Server:  1. Storage witch traces 
need to monitor(i suggest storage on the endpoint,  add new boolean field named 
needThreadMonitor)  2. Provide GraphQL API to change endpoint monitor status.  
3. Monitor Trace parse, storage thread stack if the segment has any thread  
info.    Skywalking-Rocketbot-UI:  1. Add a new switch button on the dashboard, 
It can read or modify  endpoint status.  2. It will show every thread stack on 
click trace detail.    Agent:  1. setup two new BootService:  1) find any need 
thread monitor endpoint in current service, start on a  new schedule take and 
works on each minute.  2) start new schedule task to get all need thread 
monitor segment each 100  milliseconds, and put a new thread dump task to a 
global thread pool(fixed,  count number default 3).  2. check endpoint need 
thread monitor on create entry/local  
span(TracingConext#createEntry/LocalSpan). If need, It will be marked and  put 
into thread monitor map.  3. when TraceingContext finishes, It will get thread 
has monitored, and  send all thread stack to server.    Finally, I don’t know 
it is a good idea to get more information on trace?  If you have any good ideas 
or suggestions on this, please let me know.    Mrpro

Reply via email to