Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics
Hi Willy, Thanks for your detailed mail. I will get back to you very soon on this. Regards, - Krishna On Tue, Jan 31, 2017 at 12:54 AM, Willy Tarreau wrote: > Hi Krishna, > > back on earth ;-) > > On Tue, Jan 03, 2017 at 03:07:26PM +0530, Krishna Kumar (Engineering) > wrote: > > I explored your suggestion of "hard-coded periods", and have some > > problems: code complexity seems to be very high at updates (as well > > as retrievals possibly); and I may not be able to get accurate results. > > E.g. I have data for 1, 4, 16 seconds; and at 18 seconds, a request is > > made for retrieval of the last 16 seconds (or 1,4,16). At this time I > have > > values for last 18 seconds not 16 seconds. I explored using timers to > > cascade (will not work as it may run into races with the setters, and > > also adds too much overhead) vs doing this synchronously when the > > event happens. Both are complicated and have the above issue of not > > able to get accurate information depending on when the request is > > made. > > The principle is quite simple but requires a small explanation of how our > frequency counters manage to report such accurate values first. > > We're counting events that happen over a period, represented by a cross > over the time line below : > > X X XX XX X X X X X > |> t > > In order to be able to measure an approximately accurate frequency without > storing too much data, we'll report an average over a period. The problem > is, if we only report a past average, we'll get slowly changing data and > in particular it would not be usable to detect quick changes such as a high > rate happening over the last second. Thus we always keep two measures, the > one of the previous period, and the one of the current period. > > X X XX XX X X X X X > |-|--> t > <--- previous --->|<--- current ---> >count=6count=5 > > And using this we implement a pseudo sliding window. With a true sliding > window, you would count events between two points that are always exactly > the same distance apart, which requires to memorize all of them. With > this pseudo sliding window, instead, we make use of the fact that events > follow an approximately homogenous distribution over each period and have > approximately the same frequency over both periods. Thus the mechanism > becomes more of a "fading" window in fact : we consider one part of the > previous window that we sum with the current one. The closer the current > date is from the end of the current window, the least we count on the > previous window. See below, I've represented 5 different sliding windows > with the ratio of the previous window that we consider as still valid > marked with stars and the ratio of the old period at the end of the line : > > X X XX XX X X X X X > |-|> t > |***--| 80% > |*| 60% > |***--| 40% > |*| 20% > |-| 0% > > Thus the measure exactly is current + (1-relative_position) * previous. > Once the period is over, the "current" value overwrites the "previous" > one, and is cleared again, waiting for new events. This is done during > insertion and possibly during value retrieval if it's noticed that the > period has switched since last time. > > > In fact this mechanism can be repeated over multiple periods if needed, > the principle must always be the same : > > X X XX XX X X X X X > |||||--> t > old N-3 N-2 N-1current > > > The "old" period is the one on which we apply the fading effect. The > current period is the one being entirely accounted (since it's being > filled as we're looking at it) so here you could always keep a few > past values (rotating copies of "current") and keep the last one for > the fade out. In the simplified case above we just don't have the > N-1..N-M, we only have "current" and "old". > > Thus as you can see here, the "current" period is the only dynamic one, > which is the exact sum of events over the last period. But now what if > we wanted to implement this with multiple levels ? It's very easy, the > "current" period being the sum of events, it simply is the sum of the > lower layers when you cascade multiple counters, thus it's the lower > layer's sum of [current] + [old*ratio] + [n-1] + [n-2] +...+ [n-m]. > > Thus let's have an example with periods of 1s, 2s, 4s and 8s : > > previous
Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics
Hi Krishna, back on earth ;-) On Tue, Jan 03, 2017 at 03:07:26PM +0530, Krishna Kumar (Engineering) wrote: > I explored your suggestion of "hard-coded periods", and have some > problems: code complexity seems to be very high at updates (as well > as retrievals possibly); and I may not be able to get accurate results. > E.g. I have data for 1, 4, 16 seconds; and at 18 seconds, a request is > made for retrieval of the last 16 seconds (or 1,4,16). At this time I have > values for last 18 seconds not 16 seconds. I explored using timers to > cascade (will not work as it may run into races with the setters, and > also adds too much overhead) vs doing this synchronously when the > event happens. Both are complicated and have the above issue of not > able to get accurate information depending on when the request is > made. The principle is quite simple but requires a small explanation of how our frequency counters manage to report such accurate values first. We're counting events that happen over a period, represented by a cross over the time line below : X X XX XX X X X X X |> t In order to be able to measure an approximately accurate frequency without storing too much data, we'll report an average over a period. The problem is, if we only report a past average, we'll get slowly changing data and in particular it would not be usable to detect quick changes such as a high rate happening over the last second. Thus we always keep two measures, the one of the previous period, and the one of the current period. X X XX XX X X X X X |-|--> t <--- previous --->|<--- current ---> count=6count=5 And using this we implement a pseudo sliding window. With a true sliding window, you would count events between two points that are always exactly the same distance apart, which requires to memorize all of them. With this pseudo sliding window, instead, we make use of the fact that events follow an approximately homogenous distribution over each period and have approximately the same frequency over both periods. Thus the mechanism becomes more of a "fading" window in fact : we consider one part of the previous window that we sum with the current one. The closer the current date is from the end of the current window, the least we count on the previous window. See below, I've represented 5 different sliding windows with the ratio of the previous window that we consider as still valid marked with stars and the ratio of the old period at the end of the line : X X XX XX X X X X X |-|> t |***--| 80% |*| 60% |***--| 40% |*| 20% |-| 0% Thus the measure exactly is current + (1-relative_position) * previous. Once the period is over, the "current" value overwrites the "previous" one, and is cleared again, waiting for new events. This is done during insertion and possibly during value retrieval if it's noticed that the period has switched since last time. In fact this mechanism can be repeated over multiple periods if needed, the principle must always be the same : X X XX XX X X X X X |||||--> t old N-3 N-2 N-1current The "old" period is the one on which we apply the fading effect. The current period is the one being entirely accounted (since it's being filled as we're looking at it) so here you could always keep a few past values (rotating copies of "current") and keep the last one for the fade out. In the simplified case above we just don't have the N-1..N-M, we only have "current" and "old". Thus as you can see here, the "current" period is the only dynamic one, which is the exact sum of events over the last period. But now what if we wanted to implement this with multiple levels ? It's very easy, the "current" period being the sum of events, it simply is the sum of the lower layers when you cascade multiple counters, thus it's the lower layer's sum of [current] + [old*ratio] + [n-1] + [n-2] +...+ [n-m]. Thus let's have an example with periods of 1s, 2s, 4s and 8s : previous current [8s] ||--> t [4s] |---|--> [2s] |---|--> [1s] |---|--> I guess you're seeing a pattern : each l
Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics
Hi Willy, Sorry to bother you again, but a quick note in case you have forgotten this patch/email-thread. Regards, - Krishna On Thu, Jan 5, 2017 at 12:53 PM, Willy Tarreau wrote: > Hi Krishna, > > On Thu, Jan 05, 2017 at 11:15:46AM +0530, Krishna Kumar (Engineering) > wrote: > > Hi Willy, > > > > If required, I can try to make the "hard-coded periods" changes too, but > > want > > to hear your opinion as the code gets very complicated, and IMHO, may not > > give correct results depending on when the request is made. All the other > > changes are doable. > > > > Hoping to hear from you on this topic, please let me know your opinion. > > I've started to think about it but had to stop, I'm just busy dealing with > some painful bugs so it takes me more time to review code additions. I'm > intentionally keeping your mail marked unread in order to get back to it > ASAP. > > Thanks, > Willy >
Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics
Hi Krishna, On Thu, Jan 05, 2017 at 11:15:46AM +0530, Krishna Kumar (Engineering) wrote: > Hi Willy, > > If required, I can try to make the "hard-coded periods" changes too, but > want > to hear your opinion as the code gets very complicated, and IMHO, may not > give correct results depending on when the request is made. All the other > changes are doable. > > Hoping to hear from you on this topic, please let me know your opinion. I've started to think about it but had to stop, I'm just busy dealing with some painful bugs so it takes me more time to review code additions. I'm intentionally keeping your mail marked unread in order to get back to it ASAP. Thanks, Willy
Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics
Hi Willy, If required, I can try to make the "hard-coded periods" changes too, but want to hear your opinion as the code gets very complicated, and IMHO, may not give correct results depending on when the request is made. All the other changes are doable. Hoping to hear from you on this topic, please let me know your opinion. Regards, - Krishna On Tue, Jan 3, 2017 at 3:07 PM, Krishna Kumar (Engineering) < krishna...@flipkart.com> wrote: > Hi Willy, > > Sorry for the late response as I was out during the year end, and thanks > once again for your review comments. > > I explored your suggestion of "hard-coded periods", and have some > problems: code complexity seems to be very high at updates (as well > as retrievals possibly); and I may not be able to get accurate results. > E.g. I have data for 1, 4, 16 seconds; and at 18 seconds, a request is > made for retrieval of the last 16 seconds (or 1,4,16). At this time I have > values for last 18 seconds not 16 seconds. I explored using timers to > cascade (will not work as it may run into races with the setters, and > also adds too much overhead) vs doing this synchronously when the > event happens. Both are complicated and have the above issue of not > able to get accurate information depending on when the request is > made. > > To implement your suggestion of say histograms, the retrieval code can > calculate the 4 values (1, 4, 16, and 64 seconds) by averaging across > the correct intervals. In this case, the new CLI command is not required, > and by default it prints all 4 values. Would this work in your opinion? > > Ack all your other suggestions, will incorporate those changes and > re-send. Please let me know if this sounds reasonable. > > Thanks, > - Krishna > > > On Thu, Dec 22, 2016 at 4:23 PM, Willy Tarreau wrote: > >> Hi Krishna, >> >> On Thu, Dec 22, 2016 at 09:41:49AM +0530, Krishna Kumar (Engineering) >> wrote: >> > We have found that the current mechanism of qtime, ctime, rtime, and >> ttime >> > based on last 1024 requests is not the most suitable to debug/visualize >> > latency issues with servers, especially if they happen to last a very >> short >> > time. For live dashboards showing server timings, we found an additional >> > last-'n' seconds metrics useful. The logs could also be parsed to derive >> > these >> > values, but suffers from delays at high volume, requiring higher >> processing >> > power and enabling logs. >> > >> > The 'last-n' seconds metrics per server/backend can be configured as >> follows >> > in the HAProxy configuration file: >> > backend backend-1 >> > stats-period 32 >> > ... >> > >> > To retrieve these stats at the CLI (in addition to existing metrics), >> run: >> > echo show stat-duration time 3 | socat /var/run/admin.sock stdio >> > >> > These are also available on the GUI. >> > >> > The justification for this patch are: >> > 1. Allows to capture spikes for a server during a short period. This >> helps >> >having dashboards that show server response times every few seconds >> (e.g. >> >every 1 second), so as to be able to chart it across timelines. >> > 2. Be able to get an average across different time intervals, e.g. the >> >configuration file may specify to save the last 32 seconds, but the >> cli >> >interface can request for average across any interval upto 32 >> seconds. >> > E.g. >> >the following command prints the existing metrics appended by the >> time >> >based ones for the last 1 second: >> > echo show stat-duration time 1 | socat /var/run/admin.sock stdio >> >Running the following existing command appends the time-based metric >> > values >> >based on the time period configured in the configuration file per >> >backend/server: >> > echo show stat | socat /var/run/admin.sock stdio >> > 3. Option per backend for configuring the server stat's time interval, >> and. >> >no API breakage to stats (new metrics are added at end of line). >> > >> > Please review, any feedback on the code/usability/extensibility is very >> much >> > appreciated. >> >> First, thanks for this work. I'm having several concerns and comments >> however >> about it. >> >> The first one is that the amount of storage is overkill if the output can >> only emit an average over a few periods. I mean, the purpose of stats is >> to emit what we know internally. Some people might want to see >> historgrams, >> and while we have everything internally with your patch, it's not possible >> to produce them. >> >> For this reason I think we should proceed differently and always emit >> these >> stats over a few hard-coded periods. You proved that they don't take that >> much space, and I think it would make sense probably to emit them over a >> small series of power of 4 seconds : 1s, 4s, 16s, 64s. That's quite cheap >> to store and easy to compute because it's not needed anymore to store all >> individual values, you can cascade them while filling a bucket. >> >> And if you go down that route then
Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics
Hi Willy, Sorry for the late response as I was out during the year end, and thanks once again for your review comments. I explored your suggestion of "hard-coded periods", and have some problems: code complexity seems to be very high at updates (as well as retrievals possibly); and I may not be able to get accurate results. E.g. I have data for 1, 4, 16 seconds; and at 18 seconds, a request is made for retrieval of the last 16 seconds (or 1,4,16). At this time I have values for last 18 seconds not 16 seconds. I explored using timers to cascade (will not work as it may run into races with the setters, and also adds too much overhead) vs doing this synchronously when the event happens. Both are complicated and have the above issue of not able to get accurate information depending on when the request is made. To implement your suggestion of say histograms, the retrieval code can calculate the 4 values (1, 4, 16, and 64 seconds) by averaging across the correct intervals. In this case, the new CLI command is not required, and by default it prints all 4 values. Would this work in your opinion? Ack all your other suggestions, will incorporate those changes and re-send. Please let me know if this sounds reasonable. Thanks, - Krishna On Thu, Dec 22, 2016 at 4:23 PM, Willy Tarreau wrote: > Hi Krishna, > > On Thu, Dec 22, 2016 at 09:41:49AM +0530, Krishna Kumar (Engineering) > wrote: > > We have found that the current mechanism of qtime, ctime, rtime, and > ttime > > based on last 1024 requests is not the most suitable to debug/visualize > > latency issues with servers, especially if they happen to last a very > short > > time. For live dashboards showing server timings, we found an additional > > last-'n' seconds metrics useful. The logs could also be parsed to derive > > these > > values, but suffers from delays at high volume, requiring higher > processing > > power and enabling logs. > > > > The 'last-n' seconds metrics per server/backend can be configured as > follows > > in the HAProxy configuration file: > > backend backend-1 > > stats-period 32 > > ... > > > > To retrieve these stats at the CLI (in addition to existing metrics), > run: > > echo show stat-duration time 3 | socat /var/run/admin.sock stdio > > > > These are also available on the GUI. > > > > The justification for this patch are: > > 1. Allows to capture spikes for a server during a short period. This > helps > >having dashboards that show server response times every few seconds > (e.g. > >every 1 second), so as to be able to chart it across timelines. > > 2. Be able to get an average across different time intervals, e.g. the > >configuration file may specify to save the last 32 seconds, but the > cli > >interface can request for average across any interval upto 32 seconds. > > E.g. > >the following command prints the existing metrics appended by the time > >based ones for the last 1 second: > > echo show stat-duration time 1 | socat /var/run/admin.sock stdio > >Running the following existing command appends the time-based metric > > values > >based on the time period configured in the configuration file per > >backend/server: > > echo show stat | socat /var/run/admin.sock stdio > > 3. Option per backend for configuring the server stat's time interval, > and. > >no API breakage to stats (new metrics are added at end of line). > > > > Please review, any feedback on the code/usability/extensibility is very > much > > appreciated. > > First, thanks for this work. I'm having several concerns and comments > however > about it. > > The first one is that the amount of storage is overkill if the output can > only emit an average over a few periods. I mean, the purpose of stats is > to emit what we know internally. Some people might want to see historgrams, > and while we have everything internally with your patch, it's not possible > to produce them. > > For this reason I think we should proceed differently and always emit these > stats over a few hard-coded periods. You proved that they don't take that > much space, and I think it would make sense probably to emit them over a > small series of power of 4 seconds : 1s, 4s, 16s, 64s. That's quite cheap > to store and easy to compute because it's not needed anymore to store all > individual values, you can cascade them while filling a bucket. > > And if you go down that route then there's no need anymore for adding yet > another setting in the configuration, it will be done by default. Another > point regarding this setting "stats-period" is that you currently do not > support the "defaults" section while I guess almost all users will want to > have it there. That's another reason for keeping it hard-coded. > > Now some general comments and guidance on the code itself : > > > diff --git a/include/proto/stats.h b/include/proto/stats.h > > index ac893b8..f33ceb1 100644 > > --- a/include/proto/stats.h > > +++ b/include/proto/stats.h > > @@ -95,7 +95,8 @@ int s
Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics
Thanks Willy for your detailed review, especially some design related points that I was not aware of. I will go through these and respond accordingly. Regards, - Krishna On Thu, Dec 22, 2016 at 4:23 PM, Willy Tarreau wrote: > Hi Krishna, > > On Thu, Dec 22, 2016 at 09:41:49AM +0530, Krishna Kumar (Engineering) > wrote: > > We have found that the current mechanism of qtime, ctime, rtime, and > ttime > > based on last 1024 requests is not the most suitable to debug/visualize > > latency issues with servers, especially if they happen to last a very > short > > time. For live dashboards showing server timings, we found an additional > > last-'n' seconds metrics useful. The logs could also be parsed to derive > > these > > values, but suffers from delays at high volume, requiring higher > processing > > power and enabling logs. > > > > The 'last-n' seconds metrics per server/backend can be configured as > follows > > in the HAProxy configuration file: > > backend backend-1 > > stats-period 32 > > ... > > > > To retrieve these stats at the CLI (in addition to existing metrics), > run: > > echo show stat-duration time 3 | socat /var/run/admin.sock stdio > > > > These are also available on the GUI. > > > > The justification for this patch are: > > 1. Allows to capture spikes for a server during a short period. This > helps > >having dashboards that show server response times every few seconds > (e.g. > >every 1 second), so as to be able to chart it across timelines. > > 2. Be able to get an average across different time intervals, e.g. the > >configuration file may specify to save the last 32 seconds, but the > cli > >interface can request for average across any interval upto 32 seconds. > > E.g. > >the following command prints the existing metrics appended by the time > >based ones for the last 1 second: > > echo show stat-duration time 1 | socat /var/run/admin.sock stdio > >Running the following existing command appends the time-based metric > > values > >based on the time period configured in the configuration file per > >backend/server: > > echo show stat | socat /var/run/admin.sock stdio > > 3. Option per backend for configuring the server stat's time interval, > and. > >no API breakage to stats (new metrics are added at end of line). > > > > Please review, any feedback on the code/usability/extensibility is very > much > > appreciated. > > First, thanks for this work. I'm having several concerns and comments > however > about it. > > The first one is that the amount of storage is overkill if the output can > only emit an average over a few periods. I mean, the purpose of stats is > to emit what we know internally. Some people might want to see historgrams, > and while we have everything internally with your patch, it's not possible > to produce them. > > For this reason I think we should proceed differently and always emit these > stats over a few hard-coded periods. You proved that they don't take that > much space, and I think it would make sense probably to emit them over a > small series of power of 4 seconds : 1s, 4s, 16s, 64s. That's quite cheap > to store and easy to compute because it's not needed anymore to store all > individual values, you can cascade them while filling a bucket. > > And if you go down that route then there's no need anymore for adding yet > another setting in the configuration, it will be done by default. Another > point regarding this setting "stats-period" is that you currently do not > support the "defaults" section while I guess almost all users will want to > have it there. That's another reason for keeping it hard-coded. > > Now some general comments and guidance on the code itself : > > > diff --git a/include/proto/stats.h b/include/proto/stats.h > > index ac893b8..f33ceb1 100644 > > --- a/include/proto/stats.h > > +++ b/include/proto/stats.h > > @@ -95,7 +95,8 @@ int stats_fill_fe_stats(struct proxy *px, struct field > *stats, int len); > > int stats_fill_li_stats(struct proxy *px, struct listener *l, int flags, > > struct field *stats, int len); > > int stats_fill_sv_stats(struct proxy *px, struct server *sv, int flags, > > -struct field *stats, int len); > > +struct field *stats, int len, > > +struct stream_interface *si); > > Please avoid re-introducing the stream_interface in stats, we're trying > hard > to abstract it. From what I'm seeing you only need it to know if there's a > different period. Better pass the period directly in argument from the > caller. > > > diff --git a/include/types/counters.h b/include/types/counters.h > > index 06ce617..a33e01f 100644 > > --- a/include/types/counters.h > > +++ b/include/types/counters.h > > @@ -105,6 +105,41 @@ struct be_counters { > > } p;/* protocol-specific stats > */ > > }; > > > > +/* > > + * Struct time-stat-counters: > > + * > > + *
Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics
Hi Krishna, On Thu, Dec 22, 2016 at 09:41:49AM +0530, Krishna Kumar (Engineering) wrote: > We have found that the current mechanism of qtime, ctime, rtime, and ttime > based on last 1024 requests is not the most suitable to debug/visualize > latency issues with servers, especially if they happen to last a very short > time. For live dashboards showing server timings, we found an additional > last-'n' seconds metrics useful. The logs could also be parsed to derive > these > values, but suffers from delays at high volume, requiring higher processing > power and enabling logs. > > The 'last-n' seconds metrics per server/backend can be configured as follows > in the HAProxy configuration file: > backend backend-1 > stats-period 32 > ... > > To retrieve these stats at the CLI (in addition to existing metrics), run: > echo show stat-duration time 3 | socat /var/run/admin.sock stdio > > These are also available on the GUI. > > The justification for this patch are: > 1. Allows to capture spikes for a server during a short period. This helps >having dashboards that show server response times every few seconds (e.g. >every 1 second), so as to be able to chart it across timelines. > 2. Be able to get an average across different time intervals, e.g. the >configuration file may specify to save the last 32 seconds, but the cli >interface can request for average across any interval upto 32 seconds. > E.g. >the following command prints the existing metrics appended by the time >based ones for the last 1 second: > echo show stat-duration time 1 | socat /var/run/admin.sock stdio >Running the following existing command appends the time-based metric > values >based on the time period configured in the configuration file per >backend/server: > echo show stat | socat /var/run/admin.sock stdio > 3. Option per backend for configuring the server stat's time interval, and. >no API breakage to stats (new metrics are added at end of line). > > Please review, any feedback on the code/usability/extensibility is very much > appreciated. First, thanks for this work. I'm having several concerns and comments however about it. The first one is that the amount of storage is overkill if the output can only emit an average over a few periods. I mean, the purpose of stats is to emit what we know internally. Some people might want to see historgrams, and while we have everything internally with your patch, it's not possible to produce them. For this reason I think we should proceed differently and always emit these stats over a few hard-coded periods. You proved that they don't take that much space, and I think it would make sense probably to emit them over a small series of power of 4 seconds : 1s, 4s, 16s, 64s. That's quite cheap to store and easy to compute because it's not needed anymore to store all individual values, you can cascade them while filling a bucket. And if you go down that route then there's no need anymore for adding yet another setting in the configuration, it will be done by default. Another point regarding this setting "stats-period" is that you currently do not support the "defaults" section while I guess almost all users will want to have it there. That's another reason for keeping it hard-coded. Now some general comments and guidance on the code itself : > diff --git a/include/proto/stats.h b/include/proto/stats.h > index ac893b8..f33ceb1 100644 > --- a/include/proto/stats.h > +++ b/include/proto/stats.h > @@ -95,7 +95,8 @@ int stats_fill_fe_stats(struct proxy *px, struct field > *stats, int len); > int stats_fill_li_stats(struct proxy *px, struct listener *l, int flags, > struct field *stats, int len); > int stats_fill_sv_stats(struct proxy *px, struct server *sv, int flags, > -struct field *stats, int len); > +struct field *stats, int len, > +struct stream_interface *si); Please avoid re-introducing the stream_interface in stats, we're trying hard to abstract it. From what I'm seeing you only need it to know if there's a different period. Better pass the period directly in argument from the caller. > diff --git a/include/types/counters.h b/include/types/counters.h > index 06ce617..a33e01f 100644 > --- a/include/types/counters.h > +++ b/include/types/counters.h > @@ -105,6 +105,41 @@ struct be_counters { > } p;/* protocol-specific stats */ > }; > > +/* > + * Struct time-stat-counters: > + * > + * Implement a last-n seconds based counters for backend/server's q/c/r/t > + * times. Total memory consumption on x86_64 to maintain all 4 event time > + * values per backend/server is: 16 + #seconds * 56 bytes. > + * for 1 second: 72 bytes > + * for 16 seconds: 912 bytes > + * for 32 seconds: 1808 bytes > + * for 64 second: 3600 bytes > + */ > +struct __tstat_counters { Avoid struct names using '__' as a prefix. In gener
[PATCH] MEDIUM/RFC: Implement time-based server latency metrics
We have found that the current mechanism of qtime, ctime, rtime, and ttime based on last 1024 requests is not the most suitable to debug/visualize latency issues with servers, especially if they happen to last a very short time. For live dashboards showing server timings, we found an additional last-'n' seconds metrics useful. The logs could also be parsed to derive these values, but suffers from delays at high volume, requiring higher processing power and enabling logs. The 'last-n' seconds metrics per server/backend can be configured as follows in the HAProxy configuration file: backend backend-1 stats-period 32 ... To retrieve these stats at the CLI (in addition to existing metrics), run: echo show stat-duration time 3 | socat /var/run/admin.sock stdio These are also available on the GUI. The justification for this patch are: 1. Allows to capture spikes for a server during a short period. This helps having dashboards that show server response times every few seconds (e.g. every 1 second), so as to be able to chart it across timelines. 2. Be able to get an average across different time intervals, e.g. the configuration file may specify to save the last 32 seconds, but the cli interface can request for average across any interval upto 32 seconds. E.g. the following command prints the existing metrics appended by the time based ones for the last 1 second: echo show stat-duration time 1 | socat /var/run/admin.sock stdio Running the following existing command appends the time-based metric values based on the time period configured in the configuration file per backend/server: echo show stat | socat /var/run/admin.sock stdio 3. Option per backend for configuring the server stat's time interval, and. no API breakage to stats (new metrics are added at end of line). Please review, any feedback on the code/usability/extensibility is very much appreciated. Thanks, - Krishna Kumar From 55d8966ee185031814aa97368adc262c78540705 Mon Sep 17 00:00:00 2001 From: Krishna Kumar Date: Thu, 22 Dec 2016 09:28:02 +0530 Subject: [PATCH] Implement a last-'n' seconds based counters for backend/server's queue, connect, response and session times. --- doc/configuration.txt | 29 ++ include/common/cfgparse.h | 3 + include/proto/stats.h | 3 +- include/types/applet.h| 1 + include/types/counters.h | 35 +++ include/types/proxy.h | 4 + include/types/server.h| 1 + include/types/stats.h | 7 ++ src/cfgparse.c| 51 ++- src/haproxy.c | 6 ++ src/hlua_fcn.c| 2 +- src/proxy.c | 39 src/server.c | 7 ++ src/stats.c | 226 +- src/stream.c | 7 ++ 15 files changed, 414 insertions(+), 7 deletions(-) diff --git a/doc/configuration.txt b/doc/configuration.txt index f654c8e..dc0ddae 100644 --- a/doc/configuration.txt +++ b/doc/configuration.txt @@ -8078,6 +8078,35 @@ stats uri See also : "stats auth", "stats enable", "stats realm" +stats-period + Enable time based average queue, connect, response and session times for + this backend and all servers contained in it. + May be used in sections : defaults | frontend | listen | backend + no|no| no | yes + Arguments : + is the maximum seconds for which to maintain the four +averages (queue, connect, response and session times). 'nsecs' is a +non-negative number, and should be a power of 2, with the maximum value +being 128. To retrieve the averages, use one of the following methods: +- echo show stat | socat /var/run/admin.sock stdio + Shows existing statistics, appended by 4 metrics corresponding to + average queue, connect, response and session times. These four + times are averaged across the last 'nsecs' seconds. +- echo show stat-duration time | socat /var/run/admin.sock stdio + Shows existing statistics, appended by 4 metrics corresponding to + average queue, connect, response and session times. These four + times are averaged across the last 'n' seconds. + + Note : When the backend does not specify the 'stats-period' option, these + metrics are zeroed out. + + Example : +# Preserve last 32 seconds of average queue, connect, response and session +# times for a backend and all servers in that backend: +backend backend-1 +stats-period 32 +server a.b.c.d a.b.c.d:80 +server a.b.c.e a.b.c.e:80 stick match [table ] [{if | unless} ] Define a request pattern matching condition to stick a user to a server diff --git a/include/common/cfgparse.h b/include/common/cfgparse.h index fd04b14..101cad9 100644 --- a/include/common/cfgparse.h +++ b/include/common/cfgparse.h @@ -110,6 +110,9 @@ static inline int warnifnotcap(struct proxy *proxy, int cap, const char *