Hi Giedrius, thanks for looking at this! Before I submit the fix PRs (plural, see below) I need some feedback on the intended implementation. I don't see a simple fix which could be contained in Pushgateway only. It seems that some changes to Prometheus code (prometheus/client_golang to be specific) are required in addition to modifying Pushgateway.
The current logic for the duplicates check is in client_golang in the Gather() method: https://github.com/prometheus/client_golang/blob/aea1a5996a9d8119592baea7310810c65dc598f5/prometheus/registry.go#L424 Unfortunately this API can only take a whole set of metrics and answer if it's consistent. It does so by calculating hashes for the whole set, which in case of Pushgateway leads to quadratic complexity. Pushgateway keeps a dynamic set of metrics and needs to keep track of its consistency and you cannot do it efficiently using the current Gather() API. So my plan is to: * factor out the consistency logic from client_golang's Gather() so that it also works for a dynamically changing set of metrics * basically expose a data structure which keeps the set of metrics with their hashes * use this logic in Pushgateway The alternative is to do a Pushgateway-only fix but this would require duplicating the logic for consistency checks, so it's probably worse than the fix above. Does that sound sensible? -- Rafał On Mon, Jan 6, 2025 at 1:08 PM Giedrius Statkevičius <[email protected]> wrote: > Hello, > > I'm not "a Prometheus dev" but this is something that I am interested in. > Could you open up a PR with the benchmark and the fix? I'll help out with > reviewing. > > Thanks, > Giedrius > > On Saturday, 4 January 2025 at 09:52:01 UTC+2 Rafał Dowgird wrote: > >> Dear and Esteemed Prometheus developers, >> >> I'd like to discuss with you a performance problem with Pushgateway, >> namely that the complexity of adding n metrics might get quadratic >> (O(n^2)). Details follow. >> >> We have a mixed push/scrape system where Pushgateway handles some of the >> metrics which come from batch jobs. While migrating some jobs to >> Pushgateway we hit a performance bottleneck. We worked around this by >> sharding Pushgateway. Still the sharded setup is more complex and the >> amount of data wasn't that big, so we investigated the Pushgateway side of >> things. >> >> It seems that the root of the problem is that every push operation causes >> recalculation of hashes for all metrics already existing in the database. >> This is how the consistency check logic works at present. >> >> I have created a simple benchmark to isolate/demonstrate the problem: >> https://github.com/dowgird/pushgateway/commit/ >> e0629ecb999c2f22cf098c87c78fc71cd0414733 >> >> The output demonstrates that subsequent pushes of metrics get linearly >> slower: >> >> I: 100 elapsed:220.379138ms diff:220.379138ms >> I: 200 elapsed:505.576881ms diff:285.197743ms >> I: 300 elapsed:841.153205ms diff:335.576324ms >> . >> . >> . >> I: 2700 elapsed:21.806380441s diff:1.391117119s >> I: 2800 elapsed:23.229272852s diff:1.422892411s >> I: 2900 elapsed:24.674250223s diff:1.444977371s >> >> Possible fix doesn't look very complicated algorithmically (memorizing >> the hashes should work). Code-wise it's a bit more complex, which is a part >> of why I'm writing this message. I can contribute the fix but this would >> require some discussion of client API. >> >> The other part is that I understand from documentation and communications >> on github issues that Pushgateway is not meant to be high performance. That >> said, I still think it would be beneficial to remove this particular >> performance bottleneck - there seem to be other people hitting it ( >> https://github.com/prometheus/pushgateway/issues/643 might be caused by >> this). >> >> Would you be open to accepting a fix for this issue? >> >> -- >> Rafał > > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Developers" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion visit > https://groups.google.com/d/msgid/prometheus-developers/b53b5c14-a4ea-4c21-8a53-aeb1cb0a6036n%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-developers/b53b5c14-a4ea-4c21-8a53-aeb1cb0a6036n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/prometheus-developers/CALJdysgq6nBpnn6d9COiOkwEcu53TcWDBY3dQma1qOjW9cw46A%40mail.gmail.com.

