[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-22 Thread Joerg Hoh (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510411#comment-17510411
 ] 

Joerg Hoh commented on SLING-11192:
---

I  think that we do not need to have strong guarantees here, because it is 
"just statistics".

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Assignee: Carsten Ziegeler
>Priority: Major
> Fix For: Event 4.3.2
>
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-22 Thread Stefan Egli (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510403#comment-17510403
 ] 

Stefan Egli commented on SLING-11192:
-

Thx, agree.

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Assignee: Carsten Ziegeler
>Priority: Major
> Fix For: Event 4.3.2
>
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-22 Thread Carsten Ziegeler (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510323#comment-17510323
 ] 

Carsten Ziegeler commented on SLING-11192:
--

[~stefanegli] Thanks, yes, the contract for getNumberOfProcessedJobs is now a 
little bit weaker - but I don't think that this is a problem as Statistics is 
not immutable and therefore calling two methods in it have no consistency 
guarantees. And yes, that mutability is the underlying problem. The state might 
change between any two method calls,
I was wondering if we should fix this - which would mean deprecating the 
current approach and providing a way to get an immutable state object. On the 
other hand, we have this inconsistency from the beginning and it doesn't seem 
to be a real world problem.

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Assignee: Carsten Ziegeler
>Priority: Major
> Fix For: Event 4.3.2
>
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-21 Thread Stefan Egli (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510056#comment-17510056
 ] 

Stefan Egli commented on SLING-11192:
-

[~cziegeler],
I agree, not all methods need to be synchronized. And some key ones I noticed 
you did leave, to which I agree.

The only suspicious one standing out now is getNumberOfProcessedJobs: that one 
would previously have given you the state of the world at a certain, precise 
point of time - while now it is a combination of 3 different times.

Not saying that is a problem, just wanted to hear how you see this.
{quote}the way this interface and implementation works doesn't provide 
consistency regardless.
{quote}
are you referring to the fact that {{Statistics}} is not immutable?

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Assignee: Carsten Ziegeler
>Priority: Major
> Fix For: Event 4.3.2
>
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-21 Thread Carsten Ziegeler (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509733#comment-17509733
 ] 

Carsten Ziegeler commented on SLING-11192:
--

[~joerghoh]/[~stefanegli]/[~rombert] It would be great if someone of you could 
have a look at my changes. Thanks

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Assignee: Carsten Ziegeler
>Priority: Major
> Fix For: Event 4.3.2
>
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-21 Thread Carsten Ziegeler (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509713#comment-17509713
 ] 

Carsten Ziegeler commented on SLING-11192:
--

Removed the sync on read in 
https://github.com/apache/sling-org-apache-sling-event/commit/91fca04d1bd428a57e1b16800536a3b7fdee8e20
The sync on change/write is still in place to order updates to the statistics

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Assignee: Carsten Ziegeler
>Priority: Major
> Fix For: Event 4.3.2
>
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-18 Thread Carsten Ziegeler (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508828#comment-17508828
 ] 

Carsten Ziegeler commented on SLING-11192:
--

I think we can remove all syncing - the way this interface and implementation 
works doesn't provide consistency regardless. So easiest is to remove all 
synchronized statements

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Priority: Major
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-18 Thread Carsten Ziegeler (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508815#comment-17508815
 ] 

Carsten Ziegeler commented on SLING-11192:
--

We can change the implementation to avoid sync on read by using immutable state 
objects

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Priority: Major
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-18 Thread Carsten Ziegeler (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508812#comment-17508812
 ] 

Carsten Ziegeler commented on SLING-11192:
--

Its for consistency, as all of them are updated by a single method: 
https://github.com/apache/sling-org-apache-sling-event/blob/master/src/main/java/org/apache/sling/event/impl/jobs/stats/StatisticsImpl.java#L226

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Priority: Major
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-18 Thread Joerg Hoh (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508795#comment-17508795
 ] 

Joerg Hoh commented on SLING-11192:
---

These metrics are all defined in 
[GaugeImpl|https://github.com/apache/sling-org-apache-sling-event/blob/master/src/main/java/org/apache/sling/event/impl/jobs/stats/GaugeSupport.java]
 and delegate to the 
[StatisticsImpl|https://github.com/apache/sling-org-apache-sling-event/blob/master/src/main/java/org/apache/sling/event/impl/jobs/stats/StatisticsImpl.java];
 there many methods are synchronized.

[~stefanegli] do you know why this is the case? I don't see a reason why so 
many of these read operations are synchronized. 


But even then almost calculation is done, but only variables are read. So there 
is definitely no heavy calculation.

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Priority: Major
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-18 Thread Robert Munteanu (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508792#comment-17508792
 ] 

Robert Munteanu commented on SLING-11192:
-

[~joerghoh] - yes, that is of course a valid approach. I was only wondering 
whether this is a problem that occurs multiple times and some metrics are 
inherently slow to calculate. But for now the sample size is 1 :-)

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Priority: Major
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-18 Thread Joerg Hoh (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508790#comment-17508790
 ] 

Joerg Hoh commented on SLING-11192:
---

[~rombert] Instead of caching I would rather like to find out why these methods 
are sometimes slow, and speed up that. Caching metrics would only be my 2nd 
choice.

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Priority: Major
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (SLING-11192) Calculating metrics takes too long

2022-03-10 Thread Robert Munteanu (Jira)


[ 
https://issues.apache.org/jira/browse/SLING-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504305#comment-17504305
 ] 

Robert Munteanu commented on SLING-11192:
-

Maybe there is an opportunity here to add some support classes to the Sling 
Metrics bundle that provide cached metrics.

> Calculating metrics takes too long
> --
>
> Key: SLING-11192
> URL: https://issues.apache.org/jira/browse/SLING-11192
> Project: Sling
>  Issue Type: Improvement
>  Components: Event
>Affects Versions: Event 4.2.24
>Reporter: Joerg Hoh
>Priority: Major
>
> we use the prometheus exporter to export Sling Metrics / Dropwizard metrics, 
> and we often see messages like this:
> {noformat}
> 10.03.2022 08:50:15.333 [...] *WARN* [qtp568481508-1779] 
> io.prometheus.client.dropwizard.DropwizardExports Gauge has been blacklisted 
> for 30 ms due timeout:  Generated from Dropwizard metric import 
> (metric=sling_event.jobs.cancelled.count, 
> type=org.apache.sling.event.impl.jobs.stats.GaugeSupport$2) 
> {noformat}
> This means that calculating the metric took too long. We should make sure 
> that the calculation is done asnychronously and just pre-computed values are 
> returned.
> For at least these values the handling needs to be improved:
> * sling_event.jobs.active.count
> * sling_event.jobs.averageProcessingTime
> * sling_event.jobs.averageWaitingTime
> * sling_event.jobs.cancelled.count



--
This message was sent by Atlassian Jira
(v8.20.1#820001)