Re: Limiting memory allocation

2022-05-20 Thread Oleksii Kliukin
Hi,

> On 18. May 2022, at 17:11, Alvaro Herrera  wrote:
> 
> On 2022-May-18, Jan Wieck wrote:
> 
>> Maybe I'm missing something, but what is it that you would actually consider
>> a solution? Knowing your current memory consumption doesn't make the need
>> for allocating some right now go away. What do you envision the response of
>> PostgreSQL to be if we had that information about resource pressure?
> 
> 
> What they (Timescale) do, is have a LD_PRELOAD library that checks
> status of memory pressure, and return NULL from malloc().  This then
> leads to clean abort of transactions and all is well.  There's nothing
> that Postgres needs to do different than today.

Correct. The library we have reads a limit supplied in an environment variable
and stores per-process and total memory usage values in shared memory counters,
updated after each call to malloc/free/calloc/realloc by the process making the
call.  When updating totals, a process picks one of N counters to update
atomically with the difference between its old and new memory usage, avoiding
congested ones; those are summed  to determine current allocations for all
processes and to compare against the limit.

> 
> I suppose that what they would like, is a way to inquire into the memory
> pressure status at MemoryContextAlloc() time and return NULL if it is
> too high.

If we call user code just before malloc (and, presumably free and realloc), the
code would have to do just as much work as when it is called from the
malloc/free/realloc wrappers inside a preloaded library. Furthermore, I don’t
see why the user would want to customize that logic: a single Linux-specific
implementation would solve the problem for everyone.


>  How exactly this would work is unclear to me; maybe one
> process keeps an eye on it in an OS-specific manner,

We don’t need to do anything for non-Linux systems, as cgroups and OOM
killer doesn’t exist there.


> and if it does get
> near the maximum, set a bit in shared memory that other processes can
> examine when MemoryContextAlloc is called.  It doesn't have to be
> exactly accurate; an approximation is probably okay.

What would be a purpose of setting a bit in shared memory when the maximum Is
about to be reached?

What would be useful is a way for Postgres to count the amount of memory
allocated by each backend.  This could be advantageous for giving per-backend
memory usage to the user, as well as for enforcing a limit on the total amount
of memory allocated by the backends.

—
Oleksii Kliukin



Issues with building cpp extensions on PostgreSQL 10+

2020-03-20 Thread Oleksii Kliukin
Hello,

I’ve recently tried to build an extension that employs C++ files and also
passes them to a linker to make a shared library. I’ve discovered a few
issues with them:

- in v10 CFLAGS_SL is not appended to the CXXFLAGS in Makefile.shlib,
resulting in cpp files compiled without -fPIC, leading to errors when
creating the shared library out of them. In v11 and above CFLAGS_SL is
prepended to the PG_CXXFLAGS, but there are no PG_CXXFLAGS on v10, and the
Makefile does nothing to add them to CXXFLAGS. Patch is attached.

- not just with v10, when building bc files from cpp, there are no CXXFLAGS
passed; as a result, when building a source with non-standard flags (i.e
-std=c++11) one would get an error during building of bc files.

The rules in the Makefile.global.(in) look like:

ifndef COMPILE.c.bc
# -Wno-ignored-attributes added so gnu_printf doesn't trigger
# warnings, when the main binary is compiled with C.
COMPILE.c.bc = $(CLANG) -Wno-ignored-attributes $(BITCODE_CFLAGS) $(CPPFLAGS) 
-flto=thin -emit-llvm -c
endif

ifndef COMPILE.cxx.bc
COMPILE.cxx.bc = $(CLANG) -xc++ -Wno-ignored-attributes $(BITCODE_CXXFLAGS) 
$(CPPFLAGS) -flto=thin -emit-llvm -c
endif

%.bc : %.c
$(COMPILE.c.bc) -o $@ $<

%.bc : %.cpp
$(COMPILE.cxx.bc) -o $@ $<

However, there seems to be no way to override BITCODE_CXXFLAGS to include
any specific C++ compilation flags that are also required to build object
files from cpp. Same applies to .bc derived from .c files with
BITCODE_CFLAGS respectively.

I am wondering if we could define something like PG_BITCODE_CXXFLAGS and
PG_BITCODE_CFLAGS in pgxs.mk to be able to override those. If this sound
like a right strategy, I’ll prepare a patch.

Cheers,
Oleksii “Alex” Kluukin



cxxflags_shared_libraries_pg10.diff
Description: Binary data


pg_upgrade and subscriptions

2019-11-02 Thread Oleksii Kliukin
Hello,

I came across a surprising behavior when upgrading our PostgreSQL 10 DBs that
also serve as a destination for the logical replication of some reference 
tables.

pg_upgrade turns off all subscriptions on the cluster and doesn't turn them on.
Specifically, it creates them with connect = false, as discussed at the thread
starting at
https://www.postgresql.org/message-id/e4fbfad5-c6ac-fd50-6777-18c84b34e...@2ndquadrant.com

Unfortunately, I can't find any mention of this in the docs of pg_upgrade, so
I am at leas willing to add those if we can't resolve this in a more automated
way (read below).

Since we can determine those subscriptions that were active on the old cluster
immediately before the upgrade, we could collect those and emit a script at
the end of the pg_upgrade to turn them on, similar to the one pg_upgrade
produces for the analyze.

There are two options when re-enabling the subscriptions: either continue from
the current position (possibly discarding all changes that happened while the
database was offline), or truncate the destination tables and copy the data 
again.
The first one corresponds to setting copy_data=false when doing refresh 
publication.
The second one is a combination of a prior truncate + refresh publication with
copy_data=true and doesn't seem like an action that is performed in a
single transaction. Would it make sense to add a copy_truncate flag, false
by default, that would instruct the logical replication worker to truncate the
destination table immediately before resyncing it from the origin?

Regards,
Oleksii





Re: [HACKERS] Help required to debug pg_repack breaking logical replication

2019-08-31 Thread Oleksii Kliukin
Hello,

Petr Jelinek  wrote:

> On 08/10/17 15:21, Craig Ringer wrote:
>> On 8 October 2017 at 02:37, Daniele Varrazzo  
>> wrote:
>>> Hello,
>>> 
>>> we have been reported, and I have experienced a couple of times,
>>> pg_repack breaking logical replication.
>>> 
>>> - https://github.com/reorg/pg_repack/issues/135
>>> - https://github.com/2ndQuadrant/pglogical/issues/113
>> 
>> Yeah, I was going to say I've seen reports of this with pglogical, but
>> I see you've linked to them.
>> 
>> I haven't had a chance to look into it though, and haven't had a
>> suitable reproducible test case.
>> 
>>> In the above issue #113, Petr Jelinek commented:
>>> 
 From quick look at pg_repack, the way it does table rewrite is almost 
 guaranteed
 to break logical decoding unless there is zero unconsumed changes for a 
 given table
 as it does not build the necessary mappings info for logical decoding that 
 standard
 heap rewrite in postgres does.
>>> 
>>> unfortunately he didn't follow up to further details requests.
>> 
>> At a guess he's referring to src/backend/access/heap/rewriteheap.c .
>> 
>> I'd explain better if I understood what was going on myself, but I
>> haven't really understood the logical decoding parts of that code.
>> 
>>> - Is Petr diagnosis right and freezing of logical replication is to be
>>> blamed to missing mapping?
>>> - Can you suggest a test to reproduce the issue reliably?
>>> - What are mapped relations anyway?
>> 
>> I can't immediately give you the answers you seek, but start by
>> studying src/backend/access/heap/rewriteheap.c . Notably
>> logical_end_heap_rewrite, logical_rewrite_heap_tuple,
>> logical_begin_heap_rewrite.
> 
> Yes that's exactly it. When table is rewritten we need to create mapping
> for every tuple that was created or removed (ie, insert, update or
> delete operation happened on it) since the oldest replication slot xmin
> for logical decoding to continue to work on that table after the
> rewrite. And pg_repack doesn't create that mapping.

Excuse me for posting to almost 2-years old thread. I haven’t found any more
recent discussions. I use pg_repack quite extensively and while I am not
running it together with logical replication yet, there is a pressing need
to do so. I came across this discussion and after reading it I feel the
answers to the question of whether it is safe to use pg_repack and logical
decoding together given there are lacking necessary details to make a
potential test case to reproduce the problem.

Looking at the source code inside rewriteheap.c, the functions mentioned
above are no-op if state->rs_logical_rewrite is set to false, and the flag
is only set to true for system catalogs (that repack doesn’t cause rewriting
of anyway) and user catalog tables (that are uncommon). Does it imply the
scope of the breakage is limited to only those two types of tables, or am I
missing some other part of the code Petr had in mind in the original comment
that stated that pg_repack is almost guaranteed to break logical decoding?

Cheers,
Oleksii



Re: pgsql: Avoid spurious deadlocks when upgrading a tuple lock

2019-06-19 Thread Oleksii Kliukin
Alvaro Herrera  wrote:

> On 2019-Jun-18, Oleksii Kliukin wrote:
> 
>> Sorry, I was confused, as I was looking only at
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=de87a084c0a5ac927017cd0834b33a932651cfc9
>> 
>> without taking your subsequent commit that silences compiler warnings at
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3da73d6839dc47f1f47ca57974bf28e5abd9b572
>> into consideration. With that commit, the danger is indeed in resetting the
>> skip mechanism on each jump and potentially causing deadlocks.
> 
> Yeah, I understand the confusion.
> 
> Anyway, as bugs go, this one seems pretty benign.  It would result in a
> unexplained deadlock, very rarely, and only for people who use a very
> strange locking pattern that includes (row-level) lock upgrades.  I
> think it also requires aborted savepoints too, though I don't rule out
> the possibility that there might be a way to reproduce this without
> that.
> 
> I pushed the patch again just now, with the new permutation.

Thank you very much for working on it and committing the fix!

Cheers,
Oleksii



Re: pgsql: Avoid spurious deadlocks when upgrading a tuple lock

2019-06-18 Thread Oleksii Kliukin
Alvaro Herrera  wrote:

> On 2019-Jun-16, Oleksii Kliukin wrote:
> 
>> Alvaro Herrera  wrote:
>> 
>>> On 2019-Jun-14, Alvaro Herrera wrote:
>>> 
>>>> I think there are worse problems here.  I tried the attached isolation
>>>> spec.  Note that the only difference in the two permutations is that s0
>>>> finishes earlier in one than the other; yet the first one works fine and
>>>> the second one hangs until killed by the 180s timeout.  (s3 isn't
>>>> released for a reason I'm not sure I understand.)
>>> 
>>> Actually, those behaviors both seem correct to me now that I look
>>> closer.  So this was a false alarm.  In the code before de87a084c0, the
>>> first permutation deadlocks, and the second permutation hangs.  The only
>>> behavior change is that the first one no longer deadlocks, which is the
>>> desired change.
>>> 
>>> I'm still trying to form a case to exercise the case of skip_tuple_lock
>>> having the wrong lifetime.
>> 
>> Hm… I think it was an oversight from my part not to give skip_lock_tuple the
>> same lifetime as have_tuple_lock or first_time (also initializing it to
>> false at the same time). Even if now it might not break anything in an
>> obvious way, a backward jump to l3 label will leave skip_lock_tuple
>> uninitialized, making it very dangerous for any future code that will rely
>> on its value.
> 
> But that's not the danger ... with the current coding, it's initialized
> to false every time through that block; that means the tuple lock will
> never be skipped if we jump back to l3.  So the danger is that the first
> iteration sets the variable, then jumps back; second iteration
> initializes the variable again, so instead of skipping the lock, it
> takes it, causing a spurious deadlock.

Sorry, I was confused, as I was looking only at
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=de87a084c0a5ac927017cd0834b33a932651cfc9

without taking your subsequent commit that silences compiler warnings at
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3da73d6839dc47f1f47ca57974bf28e5abd9b572
into consideration. With that commit, the danger is indeed in resetting the
skip mechanism on each jump and potentially causing deadlocks.

Cheers,
Oleksii



Re: pgsql: Avoid spurious deadlocks when upgrading a tuple lock

2019-06-15 Thread Oleksii Kliukin
Alvaro Herrera  wrote:

> On 2019-Jun-14, Alvaro Herrera wrote:
> 
>> I think there are worse problems here.  I tried the attached isolation
>> spec.  Note that the only difference in the two permutations is that s0
>> finishes earlier in one than the other; yet the first one works fine and
>> the second one hangs until killed by the 180s timeout.  (s3 isn't
>> released for a reason I'm not sure I understand.)
> 
> Actually, those behaviors both seem correct to me now that I look
> closer.  So this was a false alarm.  In the code before de87a084c0, the
> first permutation deadlocks, and the second permutation hangs.  The only
> behavior change is that the first one no longer deadlocks, which is the
> desired change.
> 
> I'm still trying to form a case to exercise the case of skip_tuple_lock
> having the wrong lifetime.

Hm… I think it was an oversight from my part not to give skip_lock_tuple the
same lifetime as have_tuple_lock or first_time (also initializing it to
false at the same time). Even if now it might not break anything in an
obvious way, a backward jump to l3 label will leave skip_lock_tuple
uninitialized, making it very dangerous for any future code that will rely
on its value.

> The fact that both permutations behave differently, even though the
> only difference is where s0 commits relative to the s3_share step, is an
> artifact of our unusual tuple locking implementation.

Cheers,
Oleksii



Re: upgrades in row-level locks can deadlock

2019-06-14 Thread Oleksii Kliukin
Alvaro Herrera  wrote:

> On 2019-Jun-13, Alvaro Herrera wrote:
> 
>> On 2019-Jun-13, Oleksii Kliukin wrote:
>> 
>>> Makes sense. For the symmetry I have included those that perform lock
>>> upgrades in one session and those that doesn’t, while the other sessions
>>> acquire locks, do updates or deletes. For those that don’t upgrade locks the
>>> test checks that the locks are acquired in the correct order.
>> 
>> Thanks for the updated patch!  I'm about to push to branches 9.6-master.
>> It applies semi-cleanly (only pgindent-maturity whitespace conflicts).
> 
> Done, thanks for the report and patch!
> 
> I tried hard to find a scenario that this patch breaks, but couldn't
> find anything.

Thank you very much for reviewing and committing it!

Cheers,
Oleksii



Re: upgrades in row-level locks can deadlock

2019-06-13 Thread Oleksii Kliukin
Hello,

Alvaro Herrera  wrote:

> On 2019-Jun-12, Oleksii Kliukin wrote:
> 
>> Thank you! I can make it even simpler;  s1 just acquires for share lock, s3
>> gets for update one and s2 takes for share lock first, and then tries to
>> acquire for update one; once s1 finishes, s3 deadlocks.
> 
> Cool.  I think it would be worthwhile to include a number of reasonable
> permutations instead of just one, and make sure they all work correctly.
> I don't think we need to include all possible permutations, just a few.
> I think we need at least enough permutations to cover the two places of
> the code that are modified by the patch, for both values of
> have_current_xid (so there should be four permutations, I think).

Makes sense. For the symmetry I have included those that perform lock
upgrades in one session and those that doesn’t, while the other sessions
acquire locks, do updates or deletes. For those that don’t upgrade locks the
test checks that the locks are acquired in the correct order.

> Please don't simplify the table name to just "t" -- the reason I used
> another name is that we want these tests to be able to run concurrently
> at some point; ref.
> https://postgr.es/m/20180124231006.z7spaz5gkzbdvob5@alvherre.pgsql

Alright, thanks.

> 
>>> But semantically, I wonder if your transactions are correct.  If you
>>> intend to modify the row in s1 and s2, shouldn't you be acquiring FOR NO
>>> KEY UPDATE lock instead?  I don't see how can s1 and s2 coexist
>>> peacefully.  Also, can your Y transaction use FOR NO KEY UPDATE instead
>>> .. unless you intend to delete the tuple in that transaction?
>> 
>> It is correct. I wanted to make sure jobs that acquire for key share lock
>> can run concurrently most of the time; they only execute one update at the
>> end of the job, and it is just to update the last run timestamp.
> 
> I see.  Under READ COMMITTED it works okay, I suppose.
> 
>>> I'm mulling over your proposed fix.  I don't much like the idea that
>>> DoesMultiXactIdConflict() returns that new boolean -- seems pretty
>>> ad-hoc -- but I don't see any way to do better than that ...  (If we get
>>> down to details, DoesMultiXactIdConflict needn't initialize that
>>> boolean: better let the callers do.)
>> 
>> I am also not happy about the new parameter to DoesMultiXactIdConflict, but
>> calling a separate function to fetch the presence of the current transaction
>> in the multixact would mean doing the job of DoesMultiXactIdConflict twice.
>> I am open to suggestions on how to make it nicer.
> 
> Yeah, I didn't find anything better either.  We could make things more
> complex that we resolve the multixact once and then extract the two
> sepraate bits of information that we need from that ... but it ends up
> being uglier and messier for no real gain.  So let's go with your
> original idea.

Ok, the v4 is attached. I have addressed your suggestion for the isolation
tests, added a paragraph to README.tuplock explaining why do we skip
LockTuple to avoid a deadlock in the session that upgrades its lock.

Cheers,
Oleksii



deadlock_on_row_lock_upgrades_v4.diff
Description: Binary data


Re: upgrades in row-level locks can deadlock

2019-06-12 Thread Oleksii Kliukin
Oleksii Kliukin  wrote:

> Hi,
> 
> Alvaro Herrera  wrote:
> 
>> On 2019-May-22, Oleksii Kliukin wrote:
>> 
>>> X1: select id from job where name = 'a' for key share;
>>> Y: select id from job where name = 'a' for update; -- starts waiting for X1
>>> X2: select id from job where name = 'a' for key share;
>>> X1: update job set name = 'b' where id = 1;
>>> X2: update job set name = 'c' where id = 1; -- starts waiting for X1
>>> X1: rollback;
>>> 
>>> At this point, Y is terminated by the deadlock detector:
>> 
>> Yeah, this seems undesirable in general terms.  Here's a quick
>> isolationtester spec that reproduces the problem:
>> 
>> setup {
>>  drop table if exists tlu_job;
>>  create table tlu_job (id integer primary key, name text);
>>  insert into tlu_job values (1, 'a');
>> }
>> 
>> teardown {
>>  drop table tlu_job;
>> }
>> 
>> session "s1"
>> setup{ begin; }
>> step "s1_keyshare"   { select id from tlu_job where name = 'a' for key 
>> share; }
>> step "s1_update" { update tlu_job set name = 'b' where id = 1; }
>> step "s1_rollback"   { rollback; }
>> 
>> session "s2"
>> setup{ begin; }
>> step "s2_keyshare"   { select id from tlu_job where name = 'a' for key 
>> share; }
>> step "s2_update" { update tlu_job set name = 'c' where id = 1; }
>> step "s2_commit" { commit; }
>> 
>> session "s3"
>> setup{ begin; }
>> step "s3_forupd" { select id from tlu_job where name = 'a' for update; }
>> teardown { commit; }
> 
> Thank you! I can make it even simpler;  s1 just acquires for share lock, s3
> gets for update one and s2 takes for share lock first, and then tries to
> acquire for update one; once s1 finishes, s3 deadlocks.
> 
>> But semantically, I wonder if your transactions are correct.  If you
>> intend to modify the row in s1 and s2, shouldn't you be acquiring FOR NO
>> KEY UPDATE lock instead?  I don't see how can s1 and s2 coexist
>> peacefully.  Also, can your Y transaction use FOR NO KEY UPDATE instead
>> .. unless you intend to delete the tuple in that transaction?
> 
> It is correct. I wanted to make sure jobs that acquire for key share lock
> can run concurrently most of the time; they only execute one update at the
> end of the job, and it is just to update the last run timestamp.
> 
>> I'm mulling over your proposed fix.  I don't much like the idea that
>> DoesMultiXactIdConflict() returns that new boolean -- seems pretty
>> ad-hoc -- but I don't see any way to do better than that ...  (If we get
>> down to details, DoesMultiXactIdConflict needn't initialize that
>> boolean: better let the callers do.)
> 
> I am also not happy about the new parameter to DoesMultiXactIdConflict, but
> calling a separate function to fetch the presence of the current transaction
> in the multixact would mean doing the job of DoesMultiXactIdConflict twice.
> I am open to suggestions on how to make it nicer.
> 
> Attached is a slightly modified patch that avoids initializing
> has_current_xid inside DoesMultiXactIdConflict and should apply cleanly to
> the current master.

And here is the v3 that also includes the isolation test I described above
(quoting my previous message in full as I accidentally sent it off-list,
sorry about that).

Cheers,
Oleksii


deadlock_on_row_lock_upgrades_v3.diff
Description: Binary data


upgrades in row-level locks can deadlock

2019-05-22 Thread Oleksii Kliukin

Hello,

I have recently observed a deadlock on one of our production servers related
to locking only a single row in a job table. There were two functions involved
in the deadlock, the first one acquires a “for key share” lock on the row that
represents the job it works on and subsequently updates it with the job’s end
time (we need multiple jobs to be operating on a single row concurrently,
that’s why there is a "for key share" lock). The other function starts by
acquiring the “for update” lock on the job row and then performs actions that
should not be run in parallel with other jobs.

The deadlock can be easily reproduced with the following statements. The
queries run against a table job (id integer primary key, name text) with a
single row of (1,'a'))

X1: select id from job where name = 'a' for key share;
Y: select id from job where name = 'a' for update; -- starts waiting for X1
X2: select id from job where name = 'a' for key share;
X1: update job set name = 'b' where id = 1;
X2: update job set name = 'c' where id = 1; -- starts waiting for X1
X1: rollback;

At this point, Y is terminated by the deadlock detector:

"deadlock detected",
Process 53937 waits for ShareLock on transaction 488; blocked by process 53953.
Process 53953 waits for ExclusiveLock on tuple (0,1) of relation 16386 of 
database 12931;
blocked by process 53937.
Process 53937: select id from job where name = 'a' for update;
Process 53953: update job set name = 'c' where id = 1;",

The deadlock is between X2 and Y. Y waits for X2 to finish, as X2 holds a "key
share" lock, incompatible with "for update" that Y attempts to acquire. On the
other hand, X2 needs to acquire the row lock to perform the update, and that
is a two-phase process: first, get the tuple lock and then wait for
conflicting transactions to finish, releasing the tuple lock afterward. X2
tries to acquire the tuple lock, but it is owned by Y. PostgreSQL detects the
deadlock and terminates Y.

Such a deadlock only occurs when three or more sessions locking the same row
are present and the lock is upgraded in at least one session. With only two
sessions the upgrade does not go through the lock manager, as there are no
conflicts with locks stored in the tuple. 

That gave me an idea on how to change PostgreSQL to avoid deadlocking under
the condition above. When detecting the lock upgrade from the multixact, we
can avoid acquiring the tuple lock; however, we should still wait for the
mutlixact members that hold conflicting locks, to avoid acquiring incompatible
ones.

The patch is attached. I had to tweak heap_update and heap_delete alongside
the heap_lock_tuple, as they acquire row locks as well. I am not very happy
with overloading DoesMultiXactIdConflict with another function to check if
current transaction id is among the multixact members, perhaps it is worth to
have a separate function for this. We can figure this out if we agree this is
the problem that needs to be solved and on the solution. The other possible
objection is related to the statement from README.tuplock that we need to go
through the lock manager to avoid starving waiting exclusive-lockers. Since
this patch omits the tuple lock only when the lock upgrade happens, it does
limit the starvation condition to the cases when the lock compatible with the
one the waiting process asks for is acquired first and then gets upgraded to
the incompatible one. Since under present conditions the same operation will
likely deadlock and cancel the exclusive waiter altogether, I don't see this
as a strong objection.

Cheers,
Oleksii



deadlock_on_row_lock_upgrades_v1.diff
Description: Binary data




Re: Per-tablespace autovacuum settings

2019-05-06 Thread Oleksii Kliukin
Robert Haas  wrote:

> On Thu, Apr 25, 2019 at 12:36 PM Oleksii Kliukin  wrote:
>> - Fallbacks to autovacuum parameters in another scope. Right now in the
>> absence of the per-table and per-tablespace autovacuum parameters the code
>> uses the ones from the global scope. However, if only some of the reloptions
>> are set on a per-table level (i.e. none of the autovacuum related ones), we
>> assume defaults for the rest of reloptions without consulting the lower
>> level (i.e .per-tablespace options). This is so because we don’t have the
>> mechanism to tell whether the option is set to its default value (some of
>> them use -1 to request the fallback to the outer level, but for some it’s
>> not possible, i.e. autovacuum_enabled is just a boolean value).
> 
> That sounds like it's probably not acceptable?

Yes, I think it would be inconsistent. However, it looks like all the
options from AutoVacOpts other than autovacuum_enabled are set to -1 by
default. This can be used to tell whether the option is set to its default
value. For autovacuum_enabled we don’t care much: it’s true by default and
it’s a safe choice (even if the global autovacuum is off, enabling per-table
or per-tablespace one is a no-op).

I will update the patch. 

Cheers,
Oleksii



Re: Per-tablespace autovacuum settings

2019-04-25 Thread Oleksii Kliukin
Hello,

Oleksii Kliukin  wrote:

> Tom Lane  wrote:
> 
> 
>> I don't know how to make this better, but I wish we'd take a step
>> back and think about it rather than just accreting more and more
>> complexity.
> 
> I am willing to do the refactoring when necessary, any particular place in
> the code that is indicative of the issue?

I’ve managed to return to that and here’s the first iteration of the patch
to add autovacuum parameters to tablespaces. I tried to make it as simple as
possible and didn’t make any decisions I found questionable, opting to
discuss them here instead. Some of them are probably linked to the kind of
issues mentioned by Tom upthread.

Things worth mentioning are:

- Fallbacks to autovacuum parameters in another scope. Right now in the
absence of the per-table and per-tablespace autovacuum parameters the code
uses the ones from the global scope. However, if only some of the reloptions
are set on a per-table level (i.e. none of the autovacuum related ones), we
assume defaults for the rest of reloptions without consulting the lower
level (i.e .per-tablespace options). This is so because we don’t have the
mechanism to tell whether the option is set to its default value (some of
them use -1 to request the fallback to the outer level, but for some it’s
not possible, i.e. autovacuum_enabled is just a boolean value).

- There are no separate per-tablespace settings for TOAST tables. I couldn't
find a strong case for setting all TOAST autovacuum options in a tablespace
to the same value that is distinct from the corresponding settings for the
regular tables. The difficulty of implementing TOAST options lies in the
fact that we strip the namespace part from the option name before storing it
in reltoptions. Changing that would break compatibility with previous
versions and require another step for pg_upgrade, I don’t think it is worth
the troubles. We could also come with a separate set of tablespace options,
i.e. prefixed with “toast_”, but that seems an ugly solution for the problem
that doesn’t seem real. As a result, if per-tablespace autovacuum options
are set and there are no table-specific TOAST options, the TOAST table will
inherit autovacuum options from the tablespace, rather than taking them from
the regular table it is attached to.

- There are a few relatively recently introduced options
(vacuum_index_cleanup, vacuum_truncate and
vacuum_cleanup_index_scale_factor) that I haven’t incorporated into the
per-tablespace options, as they are not part of autovacuum options and I see
no use for setting them on a tablespace level. This can be changed easily if
people think otherwise.

The patch is attached. It has a few tests and no documentation, I will
improvise both one we get in agreement on how the end result should look.


Kind regards,
Oleksii Kliukin


v1-per-tablespace-autovacuum-parameters.patch
Description: Binary data





Re: Prepared transaction releasing locks before deregistering its GID

2019-02-22 Thread Oleksii Kliukin
Hi,

Michael Paquier  wrote:

> On Wed, Feb 20, 2019 at 11:50:42AM +0100, Oleksii Kliukin wrote:
>> RecoverPreparedTransactions calls ProcessRecords with
>> twophase_recover_callbacks right after releasing the TwoPhaseStateLock, so I
>> think lock_held should be false here (matching the similar call of
>> TwoPhaseGetDummyProc at lock_twophase_recover).
> 
> Indeed.  That would be a bad idea.  We don't actually stress this
> routine in 009_twophase.pl or the assertion I added would have blown
> up immediately.  So I propose to close the gap, and the updated patch
> attached adds dedicated tests causing the problem you spotted to be
> triggered (soft and hard restarts with 2PC transactions including
> DDLs).  The previous version of the patch and those tests cause the
> assertion to blow up, failing the tests.

Thank you for updating the patch and sorry for the delay, it looks good to
me, the tests pass on my system.

>> Since the patch touches TwoPhaseGetDummyBackendId, let’s fix the comment
>> header to mention it instead of TwoPhaseGetDummyProc
> 
> Not sure I understand the issue you are pointing out here.  The patch
> adds an extra argument to both TwoPhaseGetDummyProc() and
> TwoPhaseGetDummyBackendId(), and the headers of both functions
> document the new argument.

Just this:

@@ -844,17 +851,18 @@ TwoPhaseGetGXact(TransactionId xid)
 }

 /*
- * TwoPhaseGetDummyProc
+ * TwoPhaseGetDummyBackendId
  *  Get the dummy backend ID for prepared transaction specified by XID

> 
> One extra trick I have been using for testing this patch is the
> following:
> --- a/src/backend/access/transam/twophase.c
> +++ b/src/backend/access/transam/twophase.c
> @@ -816,13 +816,6 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
> 
>Assert(!lock_held || LWLockHeldByMe(TwoPhaseStateLock));
> 
> -   /*
> -* During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
> -* repeatedly for the same XID.  We can save work with a simple cache.
> -*/
> -   if (xid == cached_xid)
> -   return cached_gxact;
> 
> This has the advantage to check more aggressively for lock conflicts,
> causing the tests to deadlock if the flag is not correctly set in the
> worst case scenario.

Nice, thank you. I ran all my tests with it and found no further issues.

Regards,
Oleksii Kliukin




Re: Prepared transaction releasing locks before deregistering its GID

2019-02-20 Thread Oleksii Kliukin
Michael Paquier  wrote:

> 
> Attached is an updated patch.  Thanks for the feedback. 

@@ -1755,7 +1755,7 @@ void
 multixact_twophase_recover(TransactionId xid, uint16 info,
   void *recdata, uint32 len)
 {
-   BackendId   dummyBackendId = TwoPhaseGetDummyBackendId(xid);
+   BackendId   dummyBackendId = TwoPhaseGetDummyBackendId(xid, true);
MultiXactId oldestMember;
 
RecoverPreparedTransactions calls ProcessRecords with
twophase_recover_callbacks right after releasing the TwoPhaseStateLock, so I
think lock_held should be false here (matching the similar call of
TwoPhaseGetDummyProc at lock_twophase_recover).

Since the patch touches TwoPhaseGetDummyBackendId, let’s fix the comment
header to mention it instead of TwoPhaseGetDummyProc

> Now, it seems
> to me that the potential ABI breakage argument (which can be solved by
> introducing an extra routine, which means more conflicts to handle
> when back-patching 2PC stuff), and the time it took to find the issue
> are pointing out that we should patch only HEAD.

Sounds reasonable.

Regards,
Oleksii Kliukin




Re: Prepared transaction releasing locks before deregistering its GID

2019-02-19 Thread Oleksii Kliukin
Hi,

Oleksii Kliukin  wrote:
> 
> The approach looks good to me. Surprisingly, I saw no stalled backends
> because of the double acquisition of lock at TwoPhaseGetGXact once I put a
> simple TwoPhaseStateLock right before the "gxact->valid = false” line; I
> will test your patch and post the outcome.

I gave it a spin on the same VM host as shown to constantly reproduce the
issue and observed neither 'identifier already in use' nor any locking
issues over a few dozens of runs, so it looks good to me.

That was HEAD, but since FinishPreparedTransaction seems to be identical
there and on the back branches it should work for PG 10 and 11 as well.

Regards,
Oleksii Kliukin




Prepared transaction releasing locks before deregistering its GID

2019-02-18 Thread Oleksii Kliukin
Hello,

We have an app that copies some metrics between predefined tables on the
source and destination clusters, truncating the table at the source
afterward.

The app locks both the source and the destination table at the beginning and
then, once copy concludes, prepares a transaction on each of the affected
hosts and commits it at last. The naming schema for the prepared transaction
involves the source and the destination table; while it doesn’t guarantee
the uniqueness if multiple copy processes run on the same host, the
possibility of that happening should be prevented by the locks held on
tables involved.

Or so we thought. From time to time, our cron jobs reported a transaction
identifier already in use. After digging into this, my colleague Ildar Musin
has produced a simple pgbench script to reproduce the issue:


$ cat custom.sql
begin;
lock abc in exclusive mode;
insert into abc values ((random() * 1000)::int);

prepare transaction 'test transaction';
commit prepared 'test transaction’;


pgbench launched with:

pgbench -T 20 -c 10 -f custom.sql postgres

(after creating the table abc) produces a number of complaints:

ERROR: transaction identifier "test transaction" is already in use. 

I could easily reproduce that on Linux, but not Mac OS, with both Postgres
10.7 and 11.2.

That looks like a race condition to me. What happens is that another
transaction with the name identical to the running one can start and proceed
to the prepare phase while the original one commits, failing at last instead
of waiting for the original one to finish.

By looking at the source code of FinishPreparedTransaction() I can see the
RemoveGXact() call, which removes the prepared transaction from
TwoPhaseState->prepXacts. It is placed at the very end of the function,
after the post-commit callbacks that clear out the locks held by the
transaction. Those callbacks are not guarded by the TwoPhaseStateLock,
resulting in a possibility for a concurrent session to proceed will
MarkAsPreparing after acquiring the locks released by them.

I couldn’t find any documentation on the expected outcome in cases like
this, so I assume it might not be a bug, but an undocumented behavior.

Should I go about and produce a patch to put a note in the description of
commit prepared, or is there any interest in changing the behavior to avoid
such conflicts altogether (perhaps by holding the lock throughout the
cleanup phase)?

Regards,
Oleksii Kliukin




Re: Per-tablespace autovacuum settings

2019-02-14 Thread Oleksii Kliukin
Tom Lane  wrote:

> Oleksii Kliukin  writes:
>> Is there any interest in making autovacuum parameters available on a
>> tablespace level in order to apply those to all vacuumable objects in the
>> tablespace?
> 
> I understand what you want to accomplish, and it doesn't seem
> unreasonable.  But I just want to point out that the situation with
> vacuuming parameters is on the edge of unintelligible already; adding
> another scope might push it over the edge.  In particular there's no
> principled way to decide whether an autovacuum parameter set at an outer
> scope should override a plain-vacuum parameter set at a narrower scope.

My naive understanding is that vacuum and autovacuum should decide
independently which scope applies, coming from the most specific (per-table
for autovacuum, per-DB for vacuum) to the broader scopes, ending with
configuration parameters at the outermost scope . Both *_cost_limit and
*_cost_delay should be taken from the current vacuum scope only if effective
autovacuum settings yield -1.

> And it's really questionable which of database-wide and tablespace-wide
> should be seen as a narrower scope in the first place.

AFAIK we don’t allow setting autovacuum options per-database; neither I
suggest enabling plain vacuum to be configured per-tablespace; as a result,
we won’t be deciding between databases and tablespaces, unless we want to do
cross-lookups from autovacuum to the outer scope of plain vacuum options
before considering autovacuum’s own outer scope and I don’t see any reason
to do that.

> I don't know how to make this better, but I wish we'd take a step
> back and think about it rather than just accreting more and more
> complexity.

I am willing to do the refactoring when necessary, any particular place in
the code that is indicative of the issue?

Regards,
Oleksii Kliukin




Re: Per-tablespace autovacuum settings

2019-02-14 Thread Oleksii Kliukin
Andres Freund  wrote:

> Hi,
> 
> On 2019-02-14 17:56:17 +0100, Oleksii Kliukin wrote:
>> Is there any interest in making autovacuum parameters available on a
>> tablespace level in order to apply those to all vacuumable objects in the
>> tablespace?
>> 
>> We have a set of tables running on ZFS, where autovacuum does almost no good
>> to us (except for preventing anti-wraparound) due to the nature of ZFS (FS
>> fragmentation caused by copy-on-write leads to sequential scans doing random
>> access) and the fact that our tables there are append-only. Initially, the
>> team in charge of the application just disabled autovacuum globally, but
>> that lead to a huge system catalog bloat.
>> 
>> At present, we have to re-enable autovacuum globally and then disable it
>> per-table using table storage parameters, but that is inelegant and requires
>> doing it once for existing tables and modifying the script that periodically
>> creates new ones (the whole system is a Postgres-based replacement of an
>> ElasticSearch cluster and we have to create new partitions regularly).
> 
> Won't that a) lead to periodic massive anti-wraparound sessions? b)
> prevent any use of index only scans?

The wraparound is hardly an issue there, as the data is transient and only
exist for 14 days (I think the entire date-based partition is dropped,
that’s how we ended up with pg_class catalog bloat). The index-only scan can
be an issue, although, IIRC, there is some manual vacuum that runs from time
to time, perhaps following your advice below.

> ISTM you'd be better off running vacuum rarely, with large
> thresholds. That way it'd do most of the writes in one pass, hopefully
> leading to less fragementation, and it'd set the visibilitymap bits to
> prevent further need to touch those. By doing it only rarely, vacuum
> should process pages sequentially, reducing the fragmentation.
> 
> 
>> Grouping tables by tablespaces for the purpose of autovacuum configuration
>> seems natural, as tablespaces are often placed on another filesystems/device
>> that may require changing how often does autovacuum run, make it less/more
>> aggressive depending on the I/O performance or require disabling it
>> altogether as in my example above. Furthermore, given that we allow
>> cost-based options per-tablespace the infrastructure is already there and
>> the task is mostly to teach autovacuum to look at tablespaces in addition to
>> the relation storage options (in case of a conflict, relation options should
>> always take priority).
> 
> While I don't buy the reasoning above, I think this'd be useful for
> other cases.

Even if we don’t want to disable autovacuum completely, we might want to
make it much less frequent by increasing the thresholds or costs/delays to
reduce the I/O strain for a particular tablespace.

Regards,
Oleksii Kliukin




Per-tablespace autovacuum settings

2019-02-14 Thread Oleksii Kliukin
Hello,

Is there any interest in making autovacuum parameters available on a
tablespace level in order to apply those to all vacuumable objects in the
tablespace?

We have a set of tables running on ZFS, where autovacuum does almost no good
to us (except for preventing anti-wraparound) due to the nature of ZFS (FS
fragmentation caused by copy-on-write leads to sequential scans doing random
access) and the fact that our tables there are append-only. Initially, the
team in charge of the application just disabled autovacuum globally, but
that lead to a huge system catalog bloat.

At present, we have to re-enable autovacuum globally and then disable it
per-table using table storage parameters, but that is inelegant and requires
doing it once for existing tables and modifying the script that periodically
creates new ones (the whole system is a Postgres-based replacement of an
ElasticSearch cluster and we have to create new partitions regularly).

Grouping tables by tablespaces for the purpose of autovacuum configuration
seems natural, as tablespaces are often placed on another filesystems/device
that may require changing how often does autovacuum run, make it less/more
aggressive depending on the I/O performance or require disabling it
altogether as in my example above. Furthermore, given that we allow
cost-based options per-tablespace the infrastructure is already there and
the task is mostly to teach autovacuum to look at tablespaces in addition to
the relation storage options (in case of a conflict, relation options should
always take priority).


Regards,
Oleksii Kliukin




Re: Connection slots reserved for replication

2019-02-12 Thread Oleksii Kliukin



> On 12. Feb 2019, at 05:12, Michael Paquier  wrote:
> 
> On Mon, Feb 11, 2019 at 08:57:41PM -0700, Kevin Hale Boyes wrote:
>> I think there's a small problem with the commit.
>> The position of xlrec.max_wal_senders (line 117) should be below
>> max_worker_processes.
> 
> Fixed, thanks!

Wonderful, thank you very much for taking it to commit and everyone involved 
for getting it through!

Cheers,
Oleksii


Re: Connection slots reserved for replication

2019-02-07 Thread Oleksii Kliukin


> On 7. Feb 2019, at 07:51, Michael Paquier  wrote:
> 
> On Sat, Feb 02, 2019 at 10:35:20AM +0900, Michael Paquier wrote:
>> Oh, I have just noticed this patch when doing my vacuum homework.  I
>> have just added my name as committer, just wait a bit until the CF is
>> closed before I begin looking at it..
> 
> So, I have looked at this thread and the latest patch, and the thing
> looks in good shape.  Nice job by the authors and the reviewers.  It
> is a bit of a pain for users to hope that max_connections would be
> able to handle replication connections when needed, which can cause
> backups to not happen.  Using a separate GUC while we have
> max_wal_senders here to do the job is also not a good idea, so the
> approach of the patch is sound.  And on top of that, dependencies
> between GUCs get reduced.

Thank you for picking it up!

> 
> I have spotted one error though: the patch does not check if changing
> max_wal_senders overflows MAX_BACKEND, and we need to reflect a proper
> calculation into check_autovacuum_max_workers,
> check_max_worker_processes and check_maxconnections.

Good catch, thank you. 

I’ve noticed you returned the 'see max_connections’ parameter there. As noticed
previously in this thread by Kyotaro Horiguchi, it’s not clear what exactly we
are supposed to see there, perhaps it makes sense to elaborate (besides, the
comment on max_wal_senders at guc.c:382 hints that max_wal_senders depends on
max_connections, which is not true anymore).

> 
> One thing is that the slot position calculation gets a bit more
> complicated with the new slot set for WAL senders, still the code is
> straight-forward to follow so that's not really an issue in my
> opinion.  The potential backward-incompatible issue issue of creating
> a standby with lower max_wal_senders value than the primary is also
> something we can live with as that's simple enough to address, and I'd
> like to think that most users prefer inheriting the parameter from the
> primary to ease failover flows.

+1

> 
> It seems to me that it would be a good idea to have an
> autovacuum-worker-related message in the context of InitProcess() for
> a failed backend creation, but we can deal with that later if
> necessary.

Hm.. I am wondering how the autovacuum workers can run out of slots there?

> 
> With all that reviewed, I finish with the attached that I am
> comfortable enough to commit.  XLOG_PAGE_MAGIC is bumped as well, as a
> reminder because xl_parameter_change gets an upgrade, and I am most
> likely going to forget it.
> 
> Please let me know if you have comments.  I am still planning to do an
> extra pass on it.

Apart from the comment-related notes above your changes look good to me, thank
you.

Cheers,
Oleksii


Re: Connection slots reserved for replication

2019-01-31 Thread Oleksii Kliukin



> On 31. Jan 2019, at 11:50, Petr Jelinek  wrote:
> 
> On 31/01/2019 11:25, Oleksii Kliukin wrote:
>> 
>> 
>>> On 25. Jan 2019, at 18:37, Oleksii Kliukin >> <mailto:al...@hintbits.com>> wrote:
>>> 
>>> Hello,
>>> 
>>>> On 24. Jan 2019, at 13:47, Kyotaro HORIGUCHI
>>>> >>> <mailto:horiguchi.kyot...@lab.ntt.co.jp>> wrote:
>>> 
>>> Thank you for the review.I took a liberty to address it with v9.
>> 
>> 
>> So, given there are no further feedback or suggestions, can we set it to
>> RFC?
> 
> I vote yes.

Thanks, set it to RFC.

Cheers,
Oleksii


Re: Connection slots reserved for replication

2019-01-31 Thread Oleksii Kliukin


> On 25. Jan 2019, at 18:37, Oleksii Kliukin  wrote:
> 
> Hello,
> 
>> On 24. Jan 2019, at 13:47, Kyotaro HORIGUCHI 
>> mailto:horiguchi.kyot...@lab.ntt.co.jp>> 
>> wrote:
> 
> Thank you for the review.I took a liberty to address it with v9.


So, given there are no further feedback or suggestions, can we set it to RFC?

Cheers,
Oleksii

Re: pg_basebackup, walreceiver and wal_sender_timeout

2019-01-29 Thread Oleksii Kliukin


> On 29. Jan 2019, at 10:45, Magnus Hagander  wrote:
> 
> On Tue, Jan 29, 2019 at 6:19 AM Michael Paquier  > wrote:
> On Mon, Jan 28, 2019 at 02:00:59PM +0100, Alex Kliukin wrote:
> > While reading the doc page for the pg_basebackup, I've been confused
> > by the fact that it says WAL files will be written to .tarballs
> > (either base.tar or pg_wal.tar) when pg_basebackup is instructed to
> > stream WALs alongside the backup itself. I think it makes sense to
> > elaborate that it only happens when tar format is specified (doc
> > patch is attached).
> 
> Agreed.  The current wording can be confusing depending on the format,
> and your suggestion looks right.  Any opinions from others?
> 
> Agreed, definitely confusing.
> 
> Since you also agreed on it, I went ahead and pushed (with backpatch).

Thanks a lot (and to Michael as well for looking into it)!

Cheers,
Oleksii

Re: Connection slots reserved for replication

2019-01-25 Thread Oleksii Kliukin
Hello,

> On 24. Jan 2019, at 13:47, Kyotaro HORIGUCHI 
>  wrote:

Thank you for the review.I took a liberty to address it with v9.

> 
> Documentation looks fine for me. By the way, a comment for the
> caller-site of CheckRequreParameterValues() in xlog.c looks
> somewhat stale.
> 
>> /* Check to see if any changes to max_connections give problems */
>> CheckRequiredParameterValues();
> 
> may be better something like this?:
> 
>> /* Check to see if any parameter change gives a problem on replication */

I changed it to "Check if any parameter change gives a problem on recovery”, as 
I think it is independent of the [streaming] replication, but I still don’t 
like the wording, as it just duplicate the comment at the definition of 
CheckRequiredParameterValues. Maybe remove the comment altogether?

> 
> 
> In postinit.c:
>>   /*
>>* The last few connection slots are reserved for superusers.
>>*/
>>   if ((!am_superuser && !am_walsender) &&
>>   ReservedBackends > 0 &&
> 
> This is forgetting about explaing about walsenders.
> 
>> The last few connection slots are reserved for
>> superusers. Replication connections don't share the same slot
>> pool.
> 
> Or something?

I changed it to

+* The last few connection slots are reserved for superusers.
+* Replication connections are drawn from a separate pool and
+* not limited by max_connections or superuser_reserved_connections.


> 
> And the parentheses enclosing "!am_superuser..walsender" seems no
> longer needed.
> 
> 
> In guc.c:
> - /* see max_connections and max_wal_senders */
> + /* see max_connections */
> 
> I don't understand for what reason we should see max_connections
> just above. (Or even max_wal_senders) Do we need this comment?
> (There're some other instances, but they wont'be for this patch.)

I don’t understand what is it pointing to as well, so I’ve removed it.

> 
> 
> In pg_controldata.c:
> + printf(_("max_wal_senders setting: %d\n"),
> +ControlFile->max_wal_senders);
>   printf(_("max_worker_processes setting: %d\n"),
>  ControlFile->max_worker_processes);
>   printf(_("max_prepared_xacts setting:   %d\n"),
> 
> The added item seems to want some additional spaces.

Good catch, fixed.

Attached is v9. I also bumped up the PG_CONTROL_VERSION to 1200 per the prior 
comment by Robert. 

Cheers,
Oleksii



replication_reserved_connections_v9.patch
Description: Binary data


Re: Connection slots reserved for replication

2019-01-15 Thread Oleksii Kliukin
On Wed, Jan 2, 2019, at 12:36, Alexander Kukushkin wrote:
> Hi,
> 
> On Wed, 2 Jan 2019 at 12:17, Oleksii Kliukin  wrote:
> 
> > Thank you. Attached the new version(called it v8) with the following 
> > changes to the documentation:
> 
> Thank you for jumping on it. Your changes look good to me.
> 
> 
> I was also thinking about changing the value in PG_CONTROL_VERSION,
> because we added the new field into the control file, but decided to
> leave this change to committer.

Sounds reasonable, not sure what the 'official policy' is on this.

As there is no further reviews/issues found for almost 2 weeks since the last 
one, should we move it to RFC?

Cheers,
Oleksii



Re: Connection slots reserved for replication

2019-01-02 Thread Oleksii Kliukin


On Wed, Jan 2, 2019, at 11:02, Petr Jelinek wrote:

> The patch generally looks good, but the documentation of max_wal_senders
> needs updating. In config.sgml we say:
> 
> > WAL sender processes count towards the total number
> > of connections, so this parameter's value must be less than
> >  minus
> > .
> 
> This is now misleading.

Thank you. Attached the new version(called it v8) with the following changes to 
the documentation:

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e94b305add..6634ce8cdc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -697,8 +697,7 @@ include_dir 'conf.d'
 

 The default value is three connections. The value must be less
-than max_connections minus
-.
+than max_connections.
 This parameter can only be set at server start.

   
@@ -3453,24 +3452,25 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" 
"%p"'  # Windows



-Specifies the maximum number of concurrent connections from
-standby servers or streaming base backup clients (i.e., the
-maximum number of simultaneously running WAL sender
-processes). The default is 10. The value 0 means replication is
-disabled. WAL sender processes count towards the total number
-of connections, so this parameter's value must be less than
- minus
-.
-Abrupt streaming client disconnection might leave an orphaned
-connection slot behind until
-a timeout is reached, so this parameter should be set slightly
-higher than the maximum number of expected clients so disconnected
-clients can immediately reconnect.  This parameter can only
-be set at server start.
+Specifies the maximum number of concurrent connections from standby
+servers or streaming base backup clients (i.e., the maximum number of
+simultaneously running WAL sender processes). The default is 10. The
+value 0 means replication is disabled.  Abrupt streaming client
+disconnection might leave an orphaned connection slot behind until a
+timeout is reached, so this parameter should be set slightly higher
+than the maximum number of expected clients so disconnected clients
+can immediately reconnect.  This parameter can only be set at server
+start.
 Also, wal_level must be set to
 replica or higher to allow connections from standby
 servers.

+
+   
+ When running a standby server, you must set this parameter to the
+ same or higher value than on the master server. Otherwise, queries
+ will not be allowed in the standby server.
+

   
 
Cheers,
Oleksii
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e94b305add..6634ce8cdc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -697,8 +697,7 @@ include_dir 'conf.d'
 

 The default value is three connections. The value must be less
-than max_connections minus
-.
+than max_connections.
 This parameter can only be set at server start.

   
@@ -3453,24 +3452,25 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" 
"%p"'  # Windows



-Specifies the maximum number of concurrent connections from
-standby servers or streaming base backup clients (i.e., the
-maximum number of simultaneously running WAL sender
-processes). The default is 10. The value 0 means replication is
-disabled. WAL sender processes count towards the total number
-of connections, so this parameter's value must be less than
- minus
-.
-Abrupt streaming client disconnection might leave an orphaned
-connection slot behind until
-a timeout is reached, so this parameter should be set slightly
-higher than the maximum number of expected clients so disconnected
-clients can immediately reconnect.  This parameter can only
-be set at server start.
+Specifies the maximum number of concurrent connections from standby
+servers or streaming base backup clients (i.e., the maximum number of
+simultaneously running WAL sender processes). The default is 10. The
+value 0 means replication is disabled.  Abrupt streaming client
+disconnection might leave an orphaned connection slot behind until a
+timeout is reached, so this parameter should be set slightly higher
+than the maximum number of expected clients so disconnected clients
+can immediately reconnect.  This parameter can only be set at server
+start.
 Also, wal_level must be set to
 replica or higher to allow connections from standby
 servers.

+
+   
+ When running a standby server, you 

Re: Connection slots reserved for replication

2019-01-02 Thread Oleksii Kliukin
On Mon, Dec 17, 2018, at 14:10, Alexander Kukushkin wrote:
> Hi,
> 
> On Thu, 6 Dec 2018 at 00:55, Petr Jelinek  
> wrote:
> > > Does excluding WAL senders from the max_connections limit and including 
> > > max_wal_senders in MaxBackends also imply that we need to add 
> > > max_wal_senders to the list at xlog.c: CheckRequiredParameterValues, 
> > > requiring its value on the replica to be not lower than the one on the 
> > > primary?
> > >
> >
> > I think it does, we need the proc slots for walsenders on the standby
> > same way we do for normal backends.
> 
> You are absolutely right. Attaching the new version of the patch.

Thank you. I've checked that the replica correctly complains when its value of 
max_wal_senders is lower than the one on the primary at v6. 

As stated in my previous comment, I think we should retain the specific error 
message on exceeding max_wal_senders, instead of showing the generic "too many 
clients already'.  Attached is the patch that fixes this small thing. I've also 
rebased it against the master and took a liberty of naming it v7.  It makes me 
wondering why don't we apply the same level of details to the regular out of 
connection message and don't show the actual value of max_connections in the 
error text?

The code diff to the previous version is rather small:


diff --git a/src/backend/replication/walsender.c 
b/src/backend/replication/walsender.c
index d1a8113cb6..df073673f5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2273,8 +2273,8 @@ InitWalSenderSlot(void)
Assert(MyWalSnd == NULL);
 
/*
-* Find a free walsender slot and reserve it. If this fails, we must be
-* out of WalSnd structures.
+* Find a free walsender slot and reserve it. This must not fail due
+* to the prior check for free walsenders at InitProcess.
 */
for (i = 0; i < max_wal_senders; i++)
{
@@ -2310,13 +2310,7 @@ InitWalSenderSlot(void)
break;
}
}
-   if (MyWalSnd == NULL)
-   ereport(FATAL,
-   (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
-errmsg("number of requested standby 
connections "
-   "exceeds max_wal_senders 
(currently %d)",
-   max_wal_senders)));
-
+   Assert(MyWalSnd != NULL);
/* Arrange to clean up at walsender exit */
on_shmem_exit(WalSndKill, 0);
 }

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 33387fb71b..fe33fc42e3 100644
@ -341,6 +353,12 @@ InitProcess(void)
 * in the autovacuum case?
 */
SpinLockRelease(ProcStructLock);
+   if (am_walsender)
+   ereport(FATAL,
+   (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
+errmsg("number of requested standby 
connections "
+   "exceeds 
max_wal_senders (currently %d)",
+   max_wal_senders)));
ereport(FATAL,
(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 errmsg("sorry, too many clients already")));


Cheers,
Oleksii

diff --git a/doc/src/sgml/high-availability.sgml 
b/doc/src/sgml/high-availability.sgml
index d8fd195da0..36ea513273 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2172,6 +2172,11 @@ LOG:  database system is ready to accept read only 
connections
  max_locks_per_transaction
 

+   
+
+ max_wal_senders
+
+   

 
  max_worker_processes
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 8d9d40664b..18665895cb 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -717,13 +717,13 @@ psql: could not connect to server: No such file or 
directory

 SEMMNI
 Maximum number of semaphore identifiers (i.e., sets)
-at least ceil((max_connections + 
autovacuum_max_workers + max_worker_processes + 5) / 16) plus room 
for other applications
+at least ceil((max_connections + 
autovacuum_max_workers + wax_wal_senders + max_worker_processes + 5) / 
16) plus room for other applications

 

 SEMMNS
 Maximum number of semaphores system-wide
-ceil((max_connections + autovacuum_max_workers + 
max_worker_processes + 5) / 16) * 17 plus room for other 
applications
+ceil((max_connections + autovacuum_max_workers + 
max_wal_senders + max_worker_processes + 5) / 16) * 17 plus room for 
other applications

 

@@ -780,13 +780,13 @@ psql: could not connect to server: No such file or 

Re: \gexec \watch

2018-12-06 Thread Oleksii Kliukin


Hi Álvaro,

> On 6. Dec 2018, at 13:56, Alvaro Herrera  wrote:
> 
> To Oleksii's question, I think if you want to repeat the first query
> over and over, you'd use something like this:
> 
> select format('select now() as execution_time, %L as generation_time', now()) 
> as query \gset
> :query \watch

Nice one, although it only works if the original query outputs a single row 
(because of \gset).
I do agree it’s not that useful to reuse the original query instead of 
executing it anew each time.

Cheers,
Oleksii


Re: \gexec \watch

2018-12-06 Thread Oleksii Kliukin



> On 6. Dec 2018, at 09:01, Alvaro Herrera  wrote:
> 
> On 2018-Dec-06, David Fetter wrote:
> 
>> There's a bit of a philosophical issue here, or a mathematical one,
>> whichever way you want to put it.  Does it actually make sense to have
>> the behavior of one "semicolon" spill onto another?
> 
> Honestly, I don't see the mathematicality in this.  It either works, or
> it doesn't -- and from my POV right now it doesn't.  Are you saying we
> need a \gexecwatch for this to work?

I’ve been trying to do similar stuff with periodic execution of \gexec 
(changing the tablespace of all tables in the given one and retrying, since 
some of them could only get a lock on subsequent attempts)  and generally 
reverted to a  bash loop outside of psql, but having it built-in would be great.

Perhaps a numeric argument to \gexec, i.e. \gexec 5 to re-execute the output of 
a query every 5 seconds?

The other question is whether such a command would execute the original query 
every time watch is invoked. Consider, e.g. the following one:

select format('select now() as execution_time, %L as generation_time', now()) 
\gexec
execution_time  | 2018-12-06 12:15:24.136086+01
generation_time | 2018-12-06 12:15:24.13577+01

If we make \gexec + \watch combination re-execute only the output of the 
original query (without the query itself), then the generation time column will 
stay constant through all \watch invocations.

Cheers,
Oleksii


Re: Connection slots reserved for replication

2018-12-05 Thread Oleksii Kliukin


> On 30. Nov 2018, at 13:58, Alexander Kukushkin  wrote:
> 
> attaching the new version of the patch.
> Now it simply reserves max_wal_senders slots in the ProcGlobal, what
> guarantees that only walsender process could use them.

With this patch It looks like InitProcess will trigger the generic error about 
'too many clients' before the more specific error message in InitWalSenderSlot 
when exceeding the number of max_wal_senders.

Does excluding WAL senders from the max_connections limit and including 
max_wal_senders in MaxBackends also imply that we need to add max_wal_senders 
to the list at xlog.c: CheckRequiredParameterValues, requiring its value on the 
replica to be not lower than the one on the primary? 

Cheers,
Oleksii


Re: [PATCH] Fix for infinite signal loop in parallel scan

2018-09-18 Thread Oleksii Kliukin



> On 18. Sep 2018, at 03:18, Thomas Munro  wrote:
> 
> On Tue, Sep 18, 2018 at 1:15 AM Chris Travers  
> wrote:
>> On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin  wrote:
>>> With the patch applied, the posix_fallocate loop terminated right away 
>>> (because
>>> of QueryCancelPending flag set to true) and the backend went through the
>>> cleanup, showing an ERROR of cancelling due to the conflict with recovery.
>>> Without the patch, it looped indefinitely in the dsm_impl_posix_resize, 
>>> while
>>> the startup process were looping forever, trying to send SIGUSR1.
> 
> Thanks for testing!
> 
>>> One thing I’m wondering is whether we could do the same by just blocking 
>>> SIGUSR1
>>> for the duration of posix_fallocate?
>> 
>> If we were to do that, I would say we should mask all signals we can mask 
>> during the call.
>> 
>> I don't have a problem going down that road instead if people think it is 
>> better.
> 
> We discussed that when adding posix_fallocate() and decided that
> retrying is better:
> 
> https://www.postgresql.org/message-id/20170628230458.n5ehizmvhoerr5yq%40alap3.anarazel.de
> 
> Here is a patch that I propose to commit and back-patch to 9.4.  I
> just wrote a suitable commit message, edited the comments lightly and
> fixed some whitespace.

Thanks!

Apart from the fact that the reviewer's name is “Murat Kabilov” and
not “Murak Kabilov” the back-patch looks good to me.

Cheers,
Oleksii Kliukin


Re: [PATCH] Fix for infinite signal loop in parallel scan

2018-09-17 Thread Oleksii Kliukin



> On 7. Sep 2018, at 17:57, Chris Travers  wrote:
> 
> Hi;
> 
> Attached is the patch we are fully testing at Adjust.  There are a few 
> non-obvious aspects of the code around where the patch hits.I have run 
> make check on Linux and MacOS, and make check-world on Linux (check-world 
> fails on MacOS on all versions and all branches due to ecpg failures).  
> Automatic tests are difficult because it is to a rare race condition which is 
> difficult (or possibly impossible) to automatically create.  Our current 
> approach testing is to force the issue using a debugger.  Any ideas on how to 
> reproduce automatically are appreciated but even on our heavily loaded 
> systems this seems to be a very small portion of queries that hit this case 
> (meaning the issue happens about once a week for us).


I did some manual testing on it, putting breakpoints in the
ResolveRecoveryConflictWithLock in the startup process on a streaming replica
(configured with a very low max_standby_streaming_delay, i.e. 100ms) and to the
posix_fallocate call on the normal backend on the same replica. At this point I
also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1 nonstop)
and resumed execution on both the backend and the startup process.

Then I simulated a conflict by creating a rather big (3GB) table on the master,
doing some updates on it and then running an aggregate on the replica backend
(i.e. 'select count(1) from test' with 'force_parallel_mode = true')  where I
set the breakpoint. The aggregate and force_parallel_mode ensured that
the query was executed as a parallel one, leading to allocation of the DSM

Once the backend reached the posix_fallocate call and was waiting on a
breakpoint, I called 'vacuum full test' on the master that lead to a conflict
on the replica running 'select from test' (in a vast majority of cases),
triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup
process, since the startup process tried to cancel the conflicting backend. At
that point I continued execution of the startup process (which would loop in
CancelVirtualTransaction sending SIGUSR1 to the backend while the backend
waited to be resumed from the breakpoint). Right after that, I changed the size
parameter on the backend to something that would make posix_fallocate run for a
bit longer, typically ten to hundred MB. Once the backend process was resumed,
it started receiving SIGUSR1 from the startup process, resulting in
posix_fallocate existing with EINTR.

With the patch applied, the posix_fallocate loop terminated right away (because
of QueryCancelPending flag set to true) and the backend went through the
cleanup, showing an ERROR of cancelling due to the conflict with recovery.
Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
the startup process were looping forever, trying to send SIGUSR1.

One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
for the duration of posix_fallocate?

Cheers,
Oleksii Kliukin