[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510411#comment-17510411 ] Joerg Hoh commented on SLING-11192: --- I think that we do not need to have strong guarantees here, because it is "just statistics". > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Assignee: Carsten Ziegeler >Priority: Major > Fix For: Event 4.3.2 > > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510403#comment-17510403 ] Stefan Egli commented on SLING-11192: - Thx, agree. > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Assignee: Carsten Ziegeler >Priority: Major > Fix For: Event 4.3.2 > > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510323#comment-17510323 ] Carsten Ziegeler commented on SLING-11192: -- [~stefanegli] Thanks, yes, the contract for getNumberOfProcessedJobs is now a little bit weaker - but I don't think that this is a problem as Statistics is not immutable and therefore calling two methods in it have no consistency guarantees. And yes, that mutability is the underlying problem. The state might change between any two method calls, I was wondering if we should fix this - which would mean deprecating the current approach and providing a way to get an immutable state object. On the other hand, we have this inconsistency from the beginning and it doesn't seem to be a real world problem. > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Assignee: Carsten Ziegeler >Priority: Major > Fix For: Event 4.3.2 > > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510056#comment-17510056 ] Stefan Egli commented on SLING-11192: - [~cziegeler], I agree, not all methods need to be synchronized. And some key ones I noticed you did leave, to which I agree. The only suspicious one standing out now is getNumberOfProcessedJobs: that one would previously have given you the state of the world at a certain, precise point of time - while now it is a combination of 3 different times. Not saying that is a problem, just wanted to hear how you see this. {quote}the way this interface and implementation works doesn't provide consistency regardless. {quote} are you referring to the fact that {{Statistics}} is not immutable? > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Assignee: Carsten Ziegeler >Priority: Major > Fix For: Event 4.3.2 > > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509733#comment-17509733 ] Carsten Ziegeler commented on SLING-11192: -- [~joerghoh]/[~stefanegli]/[~rombert] It would be great if someone of you could have a look at my changes. Thanks > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Assignee: Carsten Ziegeler >Priority: Major > Fix For: Event 4.3.2 > > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509713#comment-17509713 ] Carsten Ziegeler commented on SLING-11192: -- Removed the sync on read in https://github.com/apache/sling-org-apache-sling-event/commit/91fca04d1bd428a57e1b16800536a3b7fdee8e20 The sync on change/write is still in place to order updates to the statistics > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Assignee: Carsten Ziegeler >Priority: Major > Fix For: Event 4.3.2 > > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508828#comment-17508828 ] Carsten Ziegeler commented on SLING-11192: -- I think we can remove all syncing - the way this interface and implementation works doesn't provide consistency regardless. So easiest is to remove all synchronized statements > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Priority: Major > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508815#comment-17508815 ] Carsten Ziegeler commented on SLING-11192: -- We can change the implementation to avoid sync on read by using immutable state objects > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Priority: Major > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508812#comment-17508812 ] Carsten Ziegeler commented on SLING-11192: -- Its for consistency, as all of them are updated by a single method: https://github.com/apache/sling-org-apache-sling-event/blob/master/src/main/java/org/apache/sling/event/impl/jobs/stats/StatisticsImpl.java#L226 > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Priority: Major > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508795#comment-17508795 ] Joerg Hoh commented on SLING-11192: --- These metrics are all defined in [GaugeImpl|https://github.com/apache/sling-org-apache-sling-event/blob/master/src/main/java/org/apache/sling/event/impl/jobs/stats/GaugeSupport.java] and delegate to the [StatisticsImpl|https://github.com/apache/sling-org-apache-sling-event/blob/master/src/main/java/org/apache/sling/event/impl/jobs/stats/StatisticsImpl.java]; there many methods are synchronized. [~stefanegli] do you know why this is the case? I don't see a reason why so many of these read operations are synchronized. But even then almost calculation is done, but only variables are read. So there is definitely no heavy calculation. > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Priority: Major > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508792#comment-17508792 ] Robert Munteanu commented on SLING-11192: - [~joerghoh] - yes, that is of course a valid approach. I was only wondering whether this is a problem that occurs multiple times and some metrics are inherently slow to calculate. But for now the sample size is 1 :-) > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Priority: Major > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508790#comment-17508790 ] Joerg Hoh commented on SLING-11192: --- [~rombert] Instead of caching I would rather like to find out why these methods are sometimes slow, and speed up that. Caching metrics would only be my 2nd choice. > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Priority: Major > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (SLING-11192) Calculating metrics takes too long
[ https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504305#comment-17504305 ] Robert Munteanu commented on SLING-11192: - Maybe there is an opportunity here to add some support classes to the Sling Metrics bundle that provide cached metrics. > Calculating metrics takes too long > -- > > Key: SLING-11192 > URL: https://issues.apache.org/jira/browse/SLING-11192 > Project: Sling > Issue Type: Improvement > Components: Event >Affects Versions: Event 4.2.24 >Reporter: Joerg Hoh >Priority: Major > > we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, > and we often see messages like this: > {noformat} > 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] > io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted > for 30 ms due timeout: Generated from Dropwizard metric import > (metric=sling_event.jobs.cancelled.count, > type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) > {noformat} > This means that calculating the metric took too long. We should make sure > that the calculation is done asnychronously and just pre-computed values are > returned. > For at least these values the handling needs to be improved: > * sling_event.jobs.active.count > * sling_event.jobs.averageProcessingTime > * sling_event.jobs.averageWaitingTime > * sling_event.jobs.cancelled.count -- This message was sent by Atlassian Jira (v8.20.1#820001)