Re: A proposal for Skywalking(thread monitor)

741550557 Mon, 09 Dec 2019 05:54:42 -0800

Sorry, I found a formatting problem and I re-edited the content.

Thank for your reply, the issues you mentioned are very critical and meaningful.
There I will answer what you mentioned. Sorry, I'm not good at comment mode, so
I use '' prefix to QA.

As we already have designed limit mechanism at backend and agent
side(according to your design), also the number would not be big(10 most
likely), we just need a list to storage the trace-id(s)

If just need a list to storage trace-id(s), so how can I map to the thread? I
hope to use the map to quickly find thread info from trace-id.
How can I get thread-stack information from your way? Could you please help
elaborate?

Could you explain the (2), what do you mean `stop`? I think if your
sampling mechanism should include the sampling duration.

As far as the communication between the sniffer and the OAP server, I hope the
sniffer only needs to obtain the thread-monitor task that needs to be monitored
at this time. The termination condition can be stopped by the sniffer or the
OAP server.
If It’s just an OAP server notification, it may be more complicated. Cause OAP
server need record sniffer has received the current command, and sniffer is not
stable, such as sniffer has shutdown when receiving the command, at this time,
no thread information I have been collected.

I think that the active calculation termination by the OAP server can make the
monitoring more controllable, of course, the client can also actively report
the end.
I think it’s necessary to provide a protection mechanism for the sniffer side,
and it can be released quickly when the business peak period or the probe
suddenly occupies a lot of CPU / memory resources. Therefore, the function of
stopping monitoring can be provided in the UI interface, so that the sniffer
can recover.
Sampling duration is required, but only as a default termination thread-monitor
condition.

The sampling period depends on how you are going to visualize it.

Yes, I agree. I hope can provide a select/input let trace count and time
windows can be configurable in UI. Of course, this is my current idea, and if
there have other plains, I will adopt it.

Highly doubt about this, reduce the memory, maybe, only reduce if the codes
are running the loop or facing lock issue. But if it is neither of these
two, they are different.
Also, please consider the CPU cost of the comparison of the stack. You need
a performance benchmark to verify if you want this.

I didn’t understand that first sentence. In my personal experience, most of the
cases are blocking in the lock(socket/local) and running loop. I have not
imagined any other cases?
For the second sentence, I think I can add a thread-stack-element field to
storage the top-level element of last stack information. When get stack
information next time, I can compare the current top-level element that is the
same with that field.
I do this mainly to reduce duplicate thread-stack information form taking up
too much memory space, this is a way to optimizing memory space. It can
consider remove it, or do you have a better memory-saving solution? After all,
memory and CPU resources are very valuable in the sniffer.

The trace number and time window should be configurable, that is I mean
more complex. Inthe current SamplingServcie, only n traces per 3 seconds.
But here, it is a dynamic rule.

I expect that it can be configured at the UI level for special trace count and
time windows as I said above.
For SamplingService, my previous tech design was not rigorous enough, and there
were indeed problems.
Maybe we need to extend a new SamplingService, build a mapping base on
endpoint-id and AtomicInteger.
For `first 5 traces of this endpoint in the next 5 mins`, just need to
increment it.
For sampling, maybe use another schedule task to reset AtomicInteger value.

I think at least should be a level one new page called configuration or
command page, which could set up the multiple sampling rule and visualize
the existing tasks and related sampling data.

I think it’s necessary to add a new page to the configuration thread-monitor
task, I think the specific UI display should be designed in detail.
For example, what I expected is similar to the trace page. The left side
displays the configuration, and the right side quickly displays the related
trace list. When clicked, it quickly links to the trace page and displays the
sidebox display.
I ’m not good at this. Do you have any good plans?
And I feel that the two of us have a different understanding of the
configuration object. I think it is more of a task than a command. I don't know
which way is better?
I suddenly thought of a problem. I think that some real problems are often
triggered at a specific period, such as a fixed business peak period, and we
cannot guarantee that the user will operate on the UI.
So should the task mechanism be adopted to ensure that it can be run at a
specific period?

We don't have separated thread monitor view table, how about we add an icon
at the segment list, and add icon at the first span of this segment in
trace detail view?
I think the latter one should be an entrance of the thread view.

I think it's a good idea. The link I mentioned in one of the answers above, I
think it is also a convenient entry point.
The switch button I mentioned earlier is only a data filtering item in the
query of the trace list and does not need a separate table UI.

If you have some visualization idea, drawn by any tool you like supporting
comment, we could discuss it there. In my mind, we should support visualize
the thread dump stack through the time windows, and support aggregate them
by choosing the continued stack snapshots on the time window.

I think we should find a front-end who is better at discussing together because
this depends on how the front-end UI can be displayed.
BTW: I can provide code for the OAP server and sniffer, and the frontend may
need to look for help in the community alone. Hope that any front-end friends
can participate in the topic discussion.

The above is my answer to all the questions, and I look forward to your reply
at any time. As more and more discussions took place, the details became more
and more complete. This is good.
Everyone is welcome to discuss together if you have any questions or good
ideas, please let me know.

原始邮件
发件人:[email protected]
收件人:[email protected]
发送时间:2019年12月9日(周一) 21:42
主题:Re: A proposal for Skywalking(thread monitor)

Thank for your reply, the issues you mentioned are very critical and
meaningful. There I will answer what you mentioned. Sorry, I'm not good at
comment mode, so I use different colors and “ “ prefix to QA. As we already
have designed limit mechanism at backend and agent side(according to your
design), also the number would not be big(10 most likely), we just need a list
to storage the trace-id(s) If just need a list to storage trace-id(s), so how
can I map to the thread? I hope to use the map to quickly find thread info from
trace-id. How can I get thread-stack information from your way? Could you
please help elaborate? Could you explain the (2), what do you mean `stop`? I
think if your sampling mechanism should include the sampling duration. As far
as the communication between the sniffer and the OAP server, I hope the sniffer
only needs to obtain the thread-monitor task that needs to be monitored at this
time. The termination condition can be stopped by the sniffer or the OAP
server. If It’s just an OAP server notification, it may be more complicated.
Cause OAP server need record sniffer has received the current command, and
sniffer is not stable, such as sniffer has shutdown when receiving the command,
at this time, no thread information I have been collected. I think that the
active calculation termination by the OAP server can make the monitoring more
controllable, of course, the client can also actively report the end. I think
it’s necessary to provide a protection mechanism for the sniffer side, and it
can be released quickly when the business peak period or the probe suddenly
occupies a lot of CPU / memory resources. Therefore, the function of stopping
monitoring can be provided in the UI interface, so that the sniffer can
recover. Sampling duration is required, but only as a default termination
thread-monitor condition. The sampling period depends on how you are going to
visualize it. Yes, I agree. I hope can provide a select/input let trace count
and time windows can be configurable in UI. Of course, this is my current idea,
and if there have other plains, I will adopt it. Highly doubt about this,
reduce the memory, maybe, only reduce if the codes are running the loop or
facing lock issue. But if it is neither of these two, they are different. Also,
please consider the CPU cost of the comparison of the stack. You need a
performance benchmark to verify if you want this. I didn’t understand that
first sentence. In my personal experience, most of the cases are blocking in
the lock(socket/local) and running loop. I have not imagined any other cases?
For the second sentence, I think I can add a thread-stack-element field to
storage the top-level element of last stack information. When get stack
information next time, I can compare the current top-level element that is the
same with that field. I do this mainly to reduce duplicate thread-stack
information form taking up too much memory space, this is a way to optimizing
memory space. It can consider remove it, or do you have a better memory-saving
solution? After all, memory and CPU resources are very valuable in the sniffer.
The trace number and time window should be configurable, that is I mean more
complex. Inthe current SamplingServcie, only n traces per 3 seconds. But here,
it is a dynamic rule. I expect that it can be configured at the UI level for
special trace count and time windows as I said above. For SamplingService, my
previous tech design was not rigorous enough, and there were indeed problems.
Maybe we need to extend a new SamplingService, build a mapping base on
endpoint-id and AtomicInteger. For `first 5 traces of this endpoint in the next
5 mins`, just need to increment it. For sampling, maybe use another schedule
task to reset AtomicInteger value. I think at least should be a level one new
page called configuration or command page, which could set up the multiple
sampling rule and visualize the existing tasks and related sampling data. I
think it’s necessary to add a new page to the configuration thread-monitor
task, I think the specific UI display should be designed in detail. For
example, what I expected is similar to the trace page. The left side displays
the configuration, and the right side quickly displays the related trace list.
When clicked, it quickly links to the trace page and displays the sidebox
display. I ’m not good at this. Do you have any good plans? And I feel that the
two of us have a different understanding of the configuration object. I think
it is more of a task than a command. I don't know which way is better? I
suddenly thought of a problem. I think that some real problems are often
triggered at a specific period, such as a fixed business peak period, and we
cannot guarantee that the user will operate on the UI. So should the task
mechanism be adopted to ensure that it can be run at a specific period? We
don't have separated thread monitor view table, how about we add an icon at the
segment list, and add icon at the first span of this segment in trace detail
view? I think the latter one should be an entrance of the thread view. I think
it's a good idea. The link I mentioned in one of the answers above, I think it
is also a convenient entry point. The switch button I mentioned earlier is only
a data filtering item in the query of the trace list and does not need a
separate table UI. If you have some visualization idea, drawn by any tool you
like supporting comment, we could discuss it there. In my mind, we should
support visualize the thread dump stack through the time windows, and support
aggregate them by choosing the continued stack snapshots on the time window. I
think we should find a front-end who is better at discussing together because
this depends on how the front-end UI can be displayed. BTW: I can provide code
for the OAP server and sniffer, and the frontend may need to look for help in
the community alone. Hope that any front-end friends can participate in the
topic discussion. The above is my answer to all the questions, and I look
forward to your reply at any time. As more and more discussions took place, the
details became more and more complete. This is good. Everyone is welcome to
discuss together if you have any questions or good ideas, please let me know.
原始邮件 发件人:Sheng [email protected] 收件人:[email protected]
发送时间:2019年12月9日(周一) 10:50 主题:Re: A proposal for Skywalking(thread monitor) Hi
Thanks for writing this proposal with a detailed design. My comments are
inline. 741550557 [email protected] 于2019年12月8日周日 下午11:22写道： Thanks for your
reply, I have carefully read these issues you mentioned, and these issues
mentioned are very meaningful and critical. I will give you technical details
about the issues you mentioned below. I find these issues are related, so I
will explain them in different dimensions. use a different protocol to
transmission trace and thread-stack: 1. add a boolean field in segment data, to
record has thread monitored. and is good for filter monitored trace in UI. 2.
add new BootService, storage Map to record relate trace-id and trace-stack
information. As we already have designed limit mechanism at backend and agent
side(according to your design), also the number would not be big(10 most
likely), we just need a list to storage the trace-id(s) 3. listen
TracingContextListener#afterFinished if the current segment has thread
monitored, mark current trace-id don’t need to monitor anymore. (Cause if
for-each the step 2 map, the remove operation will fail and throw exception).
4. when thread-monitor main thread running, It will for-each step 2 map and
check is it don’t need monitor anymore, I will put data into new data carrier.
5. generate new thread-monitor gRPC protocol to send data from the data
carrier. The agent side design seems pretty good. the server receives
thread-stack logic: 1. storage stack-stack informations and trace-id/segment-id
relations on a different table. 2. check thread-monitor is need to be stop on
receiving data or schedule. Could you explain the (2), what do you mean `stop`?
I think if your sampling mechanism should include the sampling duration. reduce
CPU and memory in sniffer: 1. through the configuration of thread monitoring in
the UI, you can configure the performance loss. For example, set the monitoring
level: fast monitoring (100ms), medium speed monitoring (500ms), slow speed
monitoring (1000ms). The sampling period depends on how you are going to
visualize it. 2. add new integer field on per thread-stack, if current
thread-stack last element same as last time, don’t need storage, just increment
it. I think it will save a lot of memory space. Highly doubt about this, reduce
the memory, maybe, only reduce if the codes are running the loop or facing lock
issue. But if it is neither of these two, they are different. Also, please
consider the CPU cost of the comparison of the stack. You need a performance
benchmark to verify if you want this. 3. create new VM args to setting
thread-monitor pool size, It dependence on user, maybe default 3? (this can be
discussed later) I think UI limit is enough. 3 seems good to me. 4. limit
thread-stack-element size to 100, I think it can resolve most of the scenes
already. It also can create a new VM args if need. multiple sampling methods
can choose :(just my current thoughts, can add more) 1. base on current client
SamplingServcie, extra a new factor holder to increment, and reset on schedule.
Yours may be a little more complex than the current SamplingServcie, right?
Based on the next rule. 2. `first 5 traces of this endpoint in the next 5
mins`, it a good idea. My understanding is that within a few minutes, each
instance can send a specified number of traces. The trace number and time
window should be configurable, that is I mean more complex. Inthe current
SamplingServcie, only n traces per 3 seconds. But here, it is a dynamic rule.
UI settings and sniffer perception: 1. create a new button on the dashboard
page, It can create or stop a thread-monitor. It can be dynamic load
thread-monitor status when reselecting endpoint. I think at least should be a
level one new page called configuration or command page, which could set up the
multiple sampling rule and visualize the existing tasks and related sampling
data. 2. sniffer creates a new scheduled task to check the current service has
need monitor endpoint each 5 seconds. (I see current sniffer has command
functions, feel that principle is the same as the scheduler) Seems reasonable.
thread-monitor on the UI:(That’s just my initial thoughts, I think there will
have a better way to show) 1. When switch to the trace page, I think we need to
add a new switch button to filter thread-monitor trace. 2. make a new
thread-monitor icon on the same segment. It means it has thread-stack
information. We don't have separated thread monitor view table, how about we
add an icon at the segment list, and add icon at the first span of this segment
in trace detail view? I think the latter one should be an entrance of the
thread view. 3. show on the information sidebox when the user clicks the
thread-monitor segment(any span). create a new tab, like the log tab. If you
have some visualization idea, drawn by any tool you like supporting comment, we
could discuss it there. In my mind, we should support visualize the thread dump
stack through the time windows, and support aggregate them by choosing the
continued stack snapshots on the time window. They're just a description of my
current implementation details for thread-monitor if these seem to work. I can
do some time planning for these tasks. Sorry, my English is not very well, hope
you can understand. Maybe these seem to have some problem, any good idea or
suggestion are welcome. Very appreciated you to lead this new direction. It is
a long term task but should be interesting. :) Good work, carry on. 原始邮件
发件人:Sheng [email protected] 收件人:[email protected]
发送时间:2019年12月8日(周日) 08:31 主题:Re: A proposal for Skywalking(thread monitor)
First of all, thanks for your proposal. Thread monitoring is super important
for application performance. So basically, I agree with this proposal. But for
tech details, I think we need more discussion in the following ways 1. Do you
want to add thread status to the trace? If so, why don't consider this as a UI
level join? Because we could know thread id in the trace when we create a span,
right? Then we have all the thread dump(if), we could ask UI to query specific
thread context based on timestamp and thread number(s). 2. For thread dump, I
don't know whether you do the performance evaluation for this OP. From my
experiences, `get all need thread monitor segment every 100 milliseconds` is a
very high cost in your application and agent. So, you may need to be careful
about doing this. 3. Endpoint related thread dump with some sampling mechanisms
makes more sense to me. And this should be activated by UI. We should only
provide a conditional thread dump sampling mechanism, such as `first 5 traces
of this endpoint in the next 5 mins`. Jian Tan I think DaoCloud also has
customized this feature in your internal SkyWalking. Could you share what you
do? Sheng Wu 吴晟 Twitter, wusheng1108 741550557 [email protected] 于2019年12月8日周日
上午12:14写道： Hello everyone, I would like to share a new feature with skywalking,
called “thread monitor”. Background When our company used skywalking to APM
earlier, we found that many traces did not have enough span to fill up,
doubting whether there were some third-party frameworks that we didn't enhance
or programmers API usage errors such as java CountDown number is 3 but there
are only 2 countdowns. So we decide to write a new feature to monitor executing
trace thread stack, then we can get more information on the trace, quick known
what’s happening on that trace. Structure Agent(thread monitor) — gRPC protocol
— OAP Server(Storage) — Skywalking-Rocketbot-UI More detail OAP Server: 1.
Storage witch traces need to monitor(i suggest storage on the endpoint, add new
boolean field named needThreadMonitor) 2. Provide GraphQL API to change
endpoint monitor status. 3. Monitor Trace parse, storage thread stack if the
segment has any thread info. Skywalking-Rocketbot-UI: 1. Add a new switch
button on the dashboard, It can read or modify endpoint status. 2. It will show
every thread stack on click trace detail. Agent: 1. setup two new BootService:
1) find any need thread monitor endpoint in current service, start on a new
schedule take and works on each minute. 2) start new schedule task to get all
need thread monitor segment each 100 milliseconds, and put a new thread dump
task to a global thread pool(fixed, count number default 3). 2. check endpoint
need thread monitor on create entry/local
span(TracingConext#createEntry/LocalSpan). If need, It will be marked and put
into thread monitor map. 3. when TraceingContext finishes, It will get thread
has monitored, and send all thread stack to server. Finally, I don’t know it is
a good idea to get more information on trace? If you have any good ideas or
suggestions on this, please let me know. Mrpro

Re: A proposal for Skywalking(thread monitor)

Reply via email to