Hi Giedrius, thanks for looking at this!

Before I submit the fix PRs (plural, see below) I need some feedback on the
intended implementation. I don't see a simple fix which could be contained
in Pushgateway only. It seems that some changes to Prometheus code
(prometheus/client_golang to be specific) are required in addition to
modifying Pushgateway.

The current logic for the duplicates check is in client_golang in the
Gather() method:
https://github.com/prometheus/client_golang/blob/aea1a5996a9d8119592baea7310810c65dc598f5/prometheus/registry.go#L424
Unfortunately this API can only take a whole set of metrics and answer if
it's consistent. It does so by calculating hashes for the whole set, which
in case of Pushgateway leads to quadratic complexity. Pushgateway keeps a
dynamic set of metrics and needs to keep track of its consistency and you
cannot do it efficiently using the current Gather() API.

So my plan is to:

  * factor out the consistency logic from client_golang's Gather() so that
it also works for a dynamically changing set of metrics
    * basically expose a data structure which keeps the set of metrics with
their hashes
  * use this logic in Pushgateway

The alternative is to do a Pushgateway-only fix but this would require
duplicating the logic for consistency checks, so it's probably worse than
the fix above.

Does that sound sensible?

--
Rafał

On Mon, Jan 6, 2025 at 1:08 PM Giedrius Statkevičius <[email protected]>
wrote:

> Hello,
>
> I'm not "a Prometheus dev" but this is something that I am interested in.
> Could you open up a PR with the benchmark and the fix? I'll help out with
> reviewing.
>
> Thanks,
> Giedrius
>
> On Saturday, 4 January 2025 at 09:52:01 UTC+2 Rafał Dowgird wrote:
>
>> Dear and Esteemed Prometheus developers,
>>
>> I'd like to discuss with you a performance problem with Pushgateway,
>> namely that the complexity of adding n metrics might get quadratic
>> (O(n^2)). Details follow.
>>
>> We have a mixed push/scrape system where Pushgateway handles some of the
>> metrics which come from batch jobs. While migrating some jobs to
>> Pushgateway we hit a performance bottleneck. We worked around this by
>> sharding Pushgateway. Still the sharded setup is more complex and the
>> amount of data wasn't that big, so we investigated the Pushgateway side of
>> things.
>>
>> It seems that the root of the problem is that every push operation causes
>> recalculation of hashes for all metrics already existing in the database.
>> This is how the consistency check logic works at present.
>>
>> I have created a simple benchmark to isolate/demonstrate the problem:
>> https://github.com/dowgird/pushgateway/commit/
>> e0629ecb999c2f22cf098c87c78fc71cd0414733
>>
>> The output demonstrates that subsequent pushes of metrics get linearly
>> slower:
>>
>> I: 100 elapsed:220.379138ms diff:220.379138ms
>> I: 200 elapsed:505.576881ms diff:285.197743ms
>> I: 300 elapsed:841.153205ms diff:335.576324ms
>> .
>> .
>> .
>> I: 2700 elapsed:21.806380441s diff:1.391117119s
>> I: 2800 elapsed:23.229272852s diff:1.422892411s
>> I: 2900 elapsed:24.674250223s diff:1.444977371s
>>
>> Possible fix doesn't look very complicated algorithmically (memorizing
>> the hashes should work). Code-wise it's a bit more complex, which is a part
>> of why I'm writing this message. I can contribute the fix but this would
>> require some discussion of client API.
>>
>> The other part is that I understand from documentation and communications
>> on github issues that Pushgateway is not meant to be high performance. That
>> said, I still think it would be beneficial to remove this particular
>> performance bottleneck - there seem to be other people hitting it (
>> https://github.com/prometheus/pushgateway/issues/643 might be caused by
>> this).
>>
>> Would you be open to accepting a fix for this issue?
>>
>> --
>> Rafał
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion visit
> https://groups.google.com/d/msgid/prometheus-developers/b53b5c14-a4ea-4c21-8a53-aeb1cb0a6036n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-developers/b53b5c14-a4ea-4c21-8a53-aeb1cb0a6036n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/prometheus-developers/CALJdysgq6nBpnn6d9COiOkwEcu53TcWDBY3dQma1qOjW9cw46A%40mail.gmail.com.

Reply via email to