Thank for your reply, the issues you mentioned are very critical and meaningful.
There I will answer what you mentioned. Sorry, I'm not good at comment mode, so 
I use different colors and “ “ prefix to QA.


 As we already have designed limit mechanism at backend and agent
 side(according to your design), also the number would not be big(10 most
 likely), we just need a list to storage the trace-id(s)


If just need a list to storage trace-id(s), so how can I map to the thread? I 
hope to use the map to quickly find thread info from trace-id.
How can I get thread-stack information from your way? Could you please help 
elaborate?


 Could you explain the (2), what do you mean `stop`? I think if your
 sampling mechanism should include the sampling duration.


As far as the communication between the sniffer and the OAP server, I hope the 
sniffer only needs to obtain the thread-monitor task that needs to be monitored 
at this time. The termination condition can be stopped by the sniffer or the 
OAP server.
If It’s just an OAP server notification, it may be more complicated. Cause OAP 
server need record sniffer has received the current command, and sniffer is not 
stable, such as sniffer has shutdown when receiving the command, at this time, 
no thread information I have been collected.


I think that the active calculation termination by the OAP server can make the 
monitoring more controllable, of course, the client can also actively report 
the end.
I think it’s necessary to provide a protection mechanism for the sniffer side, 
and it can be released quickly when the business peak period or the probe 
suddenly occupies a lot of CPU / memory resources. Therefore, the function of 
stopping monitoring can be provided in the UI interface, so that the sniffer 
can recover.
Sampling duration is required, but only as a default termination thread-monitor 
condition.


 The sampling period depends on how you are going to visualize it.


Yes, I agree. I hope can provide a select/input let trace count and time 
windows can be configurable in UI. Of course, this is my current idea, and if 
there have other plains, I will adopt it.


 Highly doubt about this, reduce the memory, maybe, only reduce if the codes
 are running the loop or facing lock issue. But if it is neither of these
 two, they are different.
 Also, please consider the CPU cost of the comparison of the stack. You need
 a performance benchmark to verify if you want this.


I didn’t understand that first sentence. In my personal experience, most of the 
cases are blocking in the lock(socket/local) and running loop. I have not 
imagined any other cases?
For the second sentence, I think I can add a thread-stack-element field to 
storage the top-level element of last stack information. When get stack 
information next time, I can compare the current top-level element that is the 
same with that field.
I do this mainly to reduce duplicate thread-stack information form taking up 
too much memory space, this is a way to optimizing memory space. It can 
consider remove it, or do you have a better memory-saving solution? After all, 
memory and CPU resources are very valuable in the sniffer.


 The trace number and time window should be configurable, that is I mean
 more complex. Inthe current SamplingServcie, only n traces per 3 seconds.
 But here, it is a dynamic rule.


I expect that it can be configured at the UI level for special trace count and 
time windows as I said above.
For SamplingService, my previous tech design was not rigorous enough, and there 
were indeed problems.
Maybe we need to extend a new SamplingService, build a mapping base on 
endpoint-id and AtomicInteger.
For `first 5 traces of this endpoint in the next 5 mins`, just need to 
increment it.
For sampling, maybe use another schedule task to reset AtomicInteger value.


 I think at least should be a level one new page called configuration or
 command page, which could set up the multiple sampling rule and visualize
 the existing tasks and related sampling data.


I think it’s necessary to add a new page to the configuration thread-monitor 
task, I think the specific UI display should be designed in detail.
For example, what I expected is similar to the trace page. The left side 
displays the configuration, and the right side quickly displays the related 
trace list. When clicked, it quickly links to the trace page and displays the 
sidebox display.
I ’m not good at this. Do you have any good plans?
And I feel that the two of us have a different understanding of the 
configuration object. I think it is more of a task than a command. I don't know 
which way is better?
I suddenly thought of a problem. I think that some real problems are often 
triggered at a specific period, such as a fixed business peak period, and we 
cannot guarantee that the user will operate on the UI.
So should the task mechanism be adopted to ensure that it can be run at a 
specific period?


 We don't have separated thread monitor view table, how about we add an icon
 at the segment list, and add icon at the first span of this segment in
 trace detail view?
 I think the latter one should be an entrance of the thread view.


I think it's a good idea. The link I mentioned in one of the answers above, I 
think it is also a convenient entry point.
The switch button I mentioned earlier is only a data filtering item in the 
query of the trace list and does not need a separate table UI.


 If you have some visualization idea, drawn by any tool you like supporting
 comment, we could discuss it there. In my mind, we should support visualize
 the thread dump stack through the time windows, and support aggregate them
 by choosing the continued stack snapshots on the time window.


I think we should find a front-end who is better at discussing together because 
this depends on how the front-end UI can be displayed.
BTW: I can provide code for the OAP server and sniffer, and the frontend may 
need to look for help in the community alone. Hope that any front-end friends 
can participate in the topic discussion.




The above is my answer to all the questions, and I look forward to your reply 
at any time. As more and more discussions took place, the details became more 
and more complete. This is good.
Everyone is welcome to discuss together if you have any questions or good 
ideas, please let me know.


原始邮件
发件人:Sheng [email protected]
收件人:[email protected]
发送时间:2019年12月9日(周一) 10:50
主题:Re: A proposal for Skywalking(thread monitor)


Hi Thanks for writing this proposal with a detailed design. My comments are 
inline. 741550557 [email protected] 于2019年12月8日周日 下午11:22写道:  Thanks for your 
reply, I have carefully read these issues you mentioned,  and these issues 
mentioned are very meaningful and critical. I will give  you technical details 
about the issues you mentioned below.  I find these issues are related, so I 
will explain them in different  dimensions.    use a different protocol to 
transmission trace and thread-stack:  1. add a boolean field in segment data, 
to record has thread monitored.  and is good for filter monitored trace in UI.  
2. add new BootService, storage Map to record relate trace-id and  trace-stack 
information.  As we already have designed limit mechanism at backend and agent 
side(according to your design), also the number would not be big(10 most 
likely), we just need a list to storage the trace-id(s)  3. listen 
TracingContextListener#afterFinished if the current segment has  thread 
monitored, mark current trace-id don’t need to monitor anymore.  (Cause if 
for-each the step 2 map, the remove operation will fail and throw  exception).  
4. when thread-monitor main thread running, It will for-each step 2 map  and 
check is it don’t need monitor anymore, I will put data into new data  carrier. 
 5. generate new thread-monitor gRPC protocol to send data from the data  
carrier. The agent side design seems pretty good.    the server receives 
thread-stack logic:  1. storage stack-stack informations and 
trace-id/segment-id relations on a  different table.  2. check thread-monitor 
is need to be stop on receiving data or schedule.  Could you explain the (2), 
what do you mean `stop`? I think if your sampling mechanism should include the 
sampling duration.    reduce CPU and memory in sniffer:  1. through the 
configuration of thread monitoring in the UI, you can  configure the 
performance loss. For example, set the monitoring level: fast  monitoring 
(100ms), medium speed monitoring (500ms), slow speed monitoring  (1000ms).  The 
sampling period depends on how you are going to visualize it.  2. add new 
integer field on per thread-stack, if current thread-stack last  element same 
as last time, don’t need storage, just increment it. I think  it will save a 
lot of memory space. Highly doubt about this, reduce the memory, maybe, only 
reduce if the codes are running the loop or facing lock issue. But if it is 
neither of these two, they are different. Also, please consider the CPU cost of 
the comparison of the stack. You need a performance benchmark to verify if you 
want this. 3. create new VM args to setting thread-monitor pool size, It 
dependence on  user, maybe default 3? (this can be discussed later)  I think UI 
limit is enough. 3 seems good to me.  4. limit thread-stack-element size to 
100, I think it can resolve most of  the scenes already. It also can create a 
new VM args if need.    multiple sampling methods can choose :(just my current 
thoughts, can add  more)  1. base on current client SamplingServcie, extra a 
new factor holder to  increment, and reset on schedule.  Yours may be a little 
more complex than the current SamplingServcie, right? Based on the next rule. 
2. `first 5 traces of this endpoint in the next 5 mins`, it a good idea. My  
understanding is that within a few minutes, each instance can send a  specified 
number of traces.  The trace number and time window should be configurable, 
that is I mean more complex. Inthe current SamplingServcie, only n traces per 3 
seconds. But here, it is a dynamic rule.    UI settings and sniffer perception: 
 1. create a new button on the dashboard page, It can create or stop a  
thread-monitor. It can be dynamic load thread-monitor status when  reselecting 
endpoint.  I think at least should be a level one new page called configuration 
or command page, which could set up the multiple sampling rule and visualize 
the existing tasks and related sampling data.  2. sniffer creates a new 
scheduled task to check the current service has  need monitor endpoint each 5 
seconds. (I see current sniffer has command  functions, feel that principle is 
the same as the scheduler)  Seems reasonable.   thread-monitor on the 
UI:(That’s just my initial thoughts, I think there  will have a better way to 
show)  1. When switch to the trace page, I think we need to add a new switch  
button to filter thread-monitor trace.  2. make a new thread-monitor icon on 
the same segment. It means it has  thread-stack information.  We don't have 
separated thread monitor view table, how about we add an icon at the segment 
list, and add icon at the first span of this segment in trace detail view? I 
think the latter one should be an entrance of the thread view. 3. show on the 
information sidebox when the user clicks the thread-monitor  segment(any span). 
create a new tab, like the log tab.  If you have some visualization idea, drawn 
by any tool you like supporting comment, we could discuss it there. In my mind, 
we should support visualize the thread dump stack through the time windows, and 
support aggregate them by choosing the continued stack snapshots on the time 
window.   They're just a description of my current implementation details for  
thread-monitor if these seem to work. I can do some time planning for these  
tasks. Sorry, my English is not very well, hope you can understand. Maybe  
these seem to have some problem, any good idea or suggestion are welcome.  Very 
appreciated you to lead this new direction. It is a long term task but should 
be interesting. :) Good work, carry on.      原始邮件  发件人:Sheng 
[email protected]  收件人:[email protected]  
发送时间:2019年12月8日(周日) 08:31  主题:Re: A proposal for Skywalking(thread monitor)    
First of all, thanks for your proposal. Thread monitoring is super  important 
for application performance. So basically, I agree with this  proposal. But for 
tech details, I think we need more discussion in the  following ways 1. Do you 
want to add thread status to the trace? If so, why  don't consider this as a UI 
level join? Because we could know thread id in  the trace when we create a 
span, right? Then we have all the thread  dump(if), we could ask UI to query 
specific thread context based on  timestamp and thread number(s). 2. For thread 
dump, I don't know whether  you do the performance evaluation for this OP. From 
my experiences, `get  all need thread monitor segment every 100 milliseconds` 
is a very high cost  in your application and agent. So, you may need to be 
careful about doing  this. 3. Endpoint related thread dump with some sampling 
mechanisms makes  more sense to me. And this should be activated by UI. We 
should only  provide a conditional thread dump sampling mechanism, such as 
`first 5  traces of this endpoint in the next 5 mins`. Jian Tan I think 
DaoCloud also  has customized this feature in your internal SkyWalking. Could 
you share  what you do? Sheng Wu 吴晟 Twitter, wusheng1108 741550557 
[email protected]  于2019年12月8日周日 上午12:14写道: Hello everyone, I would like to 
share a new  feature with skywalking, called “thread monitor”. Background When 
our  company used skywalking to APM earlier, we found that many traces did not  
have enough span to fill up, doubting whether there were some third-party  
frameworks that we didn't enhance or programmers API usage errors such as  java 
CountDown number is 3 but there are only 2 countdowns. So we decide  to write a 
new feature to monitor executing trace thread stack, then we  can get more 
information on the trace, quick known what’s happening on  that trace. 
Structure Agent(thread monitor) — gRPC protocol — OAP  Server(Storage) — 
Skywalking-Rocketbot-UI More detail OAP Server:  1. Storage witch traces need 
to monitor(i suggest storage on the endpoint,  add new boolean field named 
needThreadMonitor) 2. Provide GraphQL API to  change endpoint monitor status. 
3. Monitor Trace parse, storage thread  stack if the segment has any thread 
info. Skywalking-Rocketbot-UI: 1.  Add a new switch button on the dashboard, It 
can read or modify endpoint  status. 2. It will show every thread stack on 
click trace detail.  Agent: 1. setup two new BootService: 1) find any need 
thread monitor  endpoint in current service, start on a new schedule take and 
works on  each minute. 2) start new schedule task to get all need thread 
monitor  segment each 100 milliseconds, and put a new thread dump task to a 
global  thread pool(fixed, count number default 3). 2. check endpoint need 
thread  monitor on create entry/local 
span(TracingConext#createEntry/LocalSpan).  If need, It will be marked and put 
into thread monitor map. 3. when  TraceingContext finishes, It will get thread 
has monitored, and send all  thread stack to server. Finally, I don’t know it 
is a good idea to get  more information on trace? If you have any good ideas or 
suggestions on  this, please let me know. Mrpro

Reply via email to