Re: pg_stat_statements locking

2022-09-13 Thread Julien Rouhaud
On Tue, Sep 13, 2022 at 10:38:13AM +0500, Andrey Borodin wrote:
> 
> And the other way is refactoring towards partitioned hashtable, namely
> dshash. But I don't see how partitioned locking can save us from a locking
> disaster. Problem is caused by reading all the pgss view colliding with
> reset() or GC.

If you store the query texts in DSM, you won't have a query text file to
maintain and the GC problem will disappear.

> Both this operations deal with each partition - they will
> conflict anyway, with the same result. Time-consuming read of each partition
> will prevent exclusive lock by reset(), and queued exclusive lock will
> prevent any reads from hashtable.

Conflicts would still be possible, just less likely and less long as the whole
dshash is never locked globally, just one partition at a time (except when the
dshash is resized, but the locks aren't held for a long time and it's not
something frequent).

But the biggest improvements should be gained by reusing the pgstats
infrastructure.  I only had a glance at it so I don't know much about it, but
it has a per-backend hashtable to cache some information and avoid too many
accesses on the shared hash table, and a mechanism to accumulate entries and do
batch updates.




Re: pg_stat_statements locking

2022-09-12 Thread Andrey Borodin



> On 12 Sep 2022, at 23:01, Tom Lane  wrote:
> 
> Andrey Borodin  writes:
>>> On 12 Sep 2022, at 18:18, Julien Rouhaud  wrote:
>>> That being
>>> said I don't know if adding a timeout would be too expensive for the lwlock
>>> infrastructure.
> 
> I want to object fiercely to loading down LWLock with anything like
> timeouts.  It's supposed to be "lightweight".  If we get away from
> that we're just going to find ourselves needing another lighter-weight
> lock mechanism.

Thanks for clarifying this out, Tom. I agree that spreading timeout-based 
algorithms is not a good thing. And when you have a hammer - everything seems 
like a nail, so it would be temping to use timeout here and there.


> On 12 Sep 2022, at 23:08, Julien Rouhaud  wrote:
> 
> That's what I was thinking, so it looks like a show-stopper for the proposed
> patch.

So, the only option to make things configurable is a switch for 
waiting\waitless locks.

And the other way is refactoring towards partitioned hashtable, namely dshash. 
But I don't see how partitioned locking can save us from a locking disaster. 
Problem is caused by reading all the pgss view colliding with reset() or GC. 
Both this operations deal with each partition - they will conflict anyway, with 
the same result. Time-consuming read of each partition will prevent exclusive 
lock by reset(), and queued exclusive lock will prevent any reads from 
hashtable.


Best regards, Andrey Borodin.



Re: pg_stat_statements locking

2022-09-12 Thread Julien Rouhaud
On Mon, Sep 12, 2022 at 02:01:23PM -0400, Tom Lane wrote:
> Andrey Borodin  writes:
> >> On 12 Sep 2022, at 18:18, Julien Rouhaud  wrote:
> >> That being
> >> said I don't know if adding a timeout would be too expensive for the lwlock
> >> infrastructure.
> 
> I want to object fiercely to loading down LWLock with anything like
> timeouts.  It's supposed to be "lightweight".  If we get away from
> that we're just going to find ourselves needing another lighter-weight
> lock mechanism.

That's what I was thinking, so it looks like a show-stopper for the proposed
patch.




Re: pg_stat_statements locking

2022-09-12 Thread Tom Lane
Andrey Borodin  writes:
>> On 12 Sep 2022, at 18:18, Julien Rouhaud  wrote:
>> That being
>> said I don't know if adding a timeout would be too expensive for the lwlock
>> infrastructure.

I want to object fiercely to loading down LWLock with anything like
timeouts.  It's supposed to be "lightweight".  If we get away from
that we're just going to find ourselves needing another lighter-weight
lock mechanism.

regards, tom lane




Re: pg_stat_statements locking

2022-09-12 Thread Andrey Borodin



> On 12 Sep 2022, at 18:18, Julien Rouhaud  wrote:
> 
> That being
> said I don't know if adding a timeout would be too expensive for the lwlock
> infrastructure.

Implementation itself is straightforward, but we need to add 3 impls of waiting 
for semaphore with timeout.
POSIX have sem_timedwait(). Windows' WaitForMultipleObjectsEx() have timeout 
arg. SysV have semtimedop().
That's what we need to add something like LWLockAcquireWithTimeout().

Does adding all these stuff sound like a good tradeoff for lock-safe 
pg_stat_statements? If so - I'll start to implement this.

Best regards, Andrey Borodin.



Re: pg_stat_statements locking

2022-09-12 Thread Julien Rouhaud
On Mon, Sep 12, 2022 at 05:32:55PM +0500, Andrey Borodin wrote:
>
> > On 12 Sep 2022, at 13:40, Julien Rouhaud  wrote:
> >
> > I'm not a fan of that patch as it now silently ignores entries if the lwlock
> > can't be acquired *immediately*, without any way to avoid that if your
> > configuration and/or workload doesn't lead to this problem, or any way to 
> > know
> > that entries were ignored.
>
> Practically, workload of this configuration is uncommon. At least I could not
> find any reports of such locking.  But theoretically, all prerequisites of a
> disaster is very common (variety of queries + some QPS of pg_stat_statements
> view + small work_mem + occasional reset() or GC).

Simply needing to evict entries regularly is already famous for being really
expensive.  See for instance [1].

> Maybe it's only a problem of programs that use pgss. pgwatch is calling pgss
> on every DB in the cluster, that's how check once in a minute became some
> queries per second.

Ah, I wasn't sure if that's what you meant in your original message.  Calling
pg_stat_statements *for every database* doesn't sound like a good idea.

Also ideally you shouldn't need to retrieve the query text every time.  There's
now pg_stat_statements_info.dealloc, so between that and the number of row
returned you can easily know if there are new query texts that you never saw
yet and cache those on the application side rather than retrieving them again
and again.

> Personally, I'd prefer if I could configure a timeout to aquire lock. That
> timeout would denote maximum delay that pgss can incur on the query. But we
> would need to extend LWLock API for this.

Yeah, that's what I meant by "immediately" in my previous message.  That being
said I don't know if adding a timeout would be too expensive for the lwlock
infrastructure.

> > On 12 Sep 2022, at 13:40, Julien Rouhaud  wrote:
> >
> > I think that the better long term approach is to move pg_stat_statements and
> > the query texts to dynamic shared memory.
>
> BTW we don't even need a dynamic memory. We need just a shared memory,
> probably pre-allocated.  I agree that pgss must reside in main memory only,
> never on disk.

We still need dynamic shared memory to get rid of the query text file, which is
a big problem on its own.  For the main hash table, relying on dshash could
allow increasing the maximum number of entries without a restart, which could
be cool if you're in a situation where you have a finite number of entries
that's higher than pg_stat_statements.max (like after creating a new role or
something).
>
> But we still will have a possibility of long lock conflicts preventing
> queries from completing. And the ability to configure pgss hooks timeout
> would be useful anyway.

I didn't look thoroughly at the new pgstats infrastructure, but from what I saw
it should be able to leverage most of the current problems.

[1] https://twitter.com/AndresFreundTec/status/1105585237772263424




Re: pg_stat_statements locking

2022-09-12 Thread Andrey Borodin



> On 12 Sep 2022, at 13:40, Julien Rouhaud  wrote:
> 
> I'm not a fan of that patch as it now silently ignores entries if the lwlock
> can't be acquired *immediately*, without any way to avoid that if your
> configuration and/or workload doesn't lead to this problem, or any way to know
> that entries were ignored.

Practically, workload of this configuration is uncommon. At least I could not 
find any reports of such locking.
But theoretically, all prerequisites of a disaster is very common (variety of 
queries + some QPS of pg_stat_statements view + small work_mem + occasional 
reset() or GC).

Maybe it's only a problem of programs that use pgss. pgwatch is calling pgss on 
every DB in the cluster, that's how check once in a minute became some queries 
per second.

Personally, I'd prefer if I could configure a timeout to aquire lock. That 
timeout would denote maximum delay that pgss can incur on the query. But we 
would need to extend LWLock API for this.



> On 12 Sep 2022, at 13:40, Julien Rouhaud  wrote:
> 
> I think that the better long term approach is to move pg_stat_statements and
> the query texts to dynamic shared memory. 

BTW we don't even need a dynamic memory. We need just a shared memory, probably 
pre-allocated.
I agree that pgss must reside in main memory only, never on disk.

But we still will have a possibility of long lock conflicts preventing queries 
from completing. And the ability to configure pgss hooks timeout would be 
useful anyway.


Best regards, Andrey Borodin.



Re: pg_stat_statements locking

2022-09-12 Thread Julien Rouhaud
Hi,

On Mon, Sep 12, 2022 at 11:52:28AM +0500, Andrey Borodin wrote:
>
> === How to fix ===
> pgss uses LWLock to protect hashtable.
> When the hashtable is reset or new user query is inserted in pgss_store() -
> exclusive lock is used.
> When stats are updated for known query or pg_stat_statements are read - 
> shared lock is used.
>
> I propose to swap shared lock for stats update to conditional shared lock.
> It cannot be taken during reset() - and that's totally fine. It cannot be
> taken during entering new query - and I think it's OK to spill some stats in
> this case. PFA patch attached.
>
> This patch prevents from a disaster incurred by described here locking.

I'm not a fan of that patch as it now silently ignores entries if the lwlock
can't be acquired *immediately*, without any way to avoid that if your
configuration and/or workload doesn't lead to this problem, or any way to know
that entries were ignored.

> === Other possible mitigation ===
> The incident would not occur if tuplestore did not convert into on-disk
> representation. But I don't see how to reliably protect from this.

I'm not sure that's true.  If you need an exclusive lwlock it means that new
entries are added.  If that keeps happening it means that you will eventually
need to defragment the query text file, and this operation will certainly hold
an exclusive lwlock for a very long time.

I think that the better long term approach is to move pg_stat_statements and
the query texts to dynamic shared memory.  This should also help in this
scenario as dshash is partitioned, so you don't have a single lwlock for the
whole hash table.  And as discussed recently, see [1], we should make the stat
collector extensible to reuse it in extensions like pg_stat_statements to
benefit from all the other optimizations.

[1] 
https://www.postgresql.org/message-id/20220818195124.c7ipzf6c5v7vx...@awork3.anarazel.de