On Thu, Aug 1, 2019 at 1:22 AM Amit Kapila <[email protected]> wrote:
> On Wed, Jul 31, 2019 at 10:13 AM Amit Kapila <[email protected]> wrote:
> > On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro <[email protected]> wrote:
> > > but
> > > here's a small thing: I managed to reach an LWLock self-deadlock in
> > > the undo worker launcher:
> > >
> >
> > I could see the problem, will fix in next version.
>
> Fixed both of these problems in the patch just posted by me [1].
I reran the script that found that problem, so I could play with the
linger logic. It creates N databases, and then it creates tables in
random databases (because I'm testing with the orphaned table cleanup
patch) and commits or rolls back at (say) 100 tx/sec. While it's
doing that, you can look at the pg_stat_undo_logs view to see the
discard and insert pointers whizzing along nicely, but if you look at
the process table with htop or similar you can see that it's forking
undo apply workers at 100/sec (the pid keeps changing), whenever there
is more than one database involved. With a single database it lingers
as I was expecting (and then creates problems when you want to drop
the database). What I was expecting to see is that if you configure
the test to generate undo work in 2, 3 or 4 dbs, and you have
max_undo_workers set to 4, then you should finish up with 4 undo apply
workers hanging around to service the work calmly without any new
forking happening. If you generate undo work in more than 4
databases, I was expecting to see the undo workers exiting and being
forked so that a total of 4 workers (at any time) can work their way
around the more-than-4 databases, but not switching as fast as they
can, so that we don't waste all our energy on forking and setup (how
fast exactly they should switch, I don't know, that's what I wanted to
see). A more advanced thing to worry about, not yet tested, is how
well they'll handle asymmetrical work distributions (not enough
workers, but some databases producing a lot and some a little undo
work). Script attached.
--
Thomas Munro
https://enterprisedb.com
# Install: python-psycopg2 (Debian/Ubuntu)
# Run with python2 ./test_undo_worker_local_balancing.py
import psycopg2 as pg
import random
import time
def make_conn(dbname):
return pg.connect("dbname=" + dbname)
def run_test(dbs, txs_per_sec, commit_ratio, runtime):
# first, create a control connection that we'll use to create databases
conn = make_conn("postgres")
conn.set_isolation_level(pg.extensions.ISOLATION_LEVEL_AUTOCOMMIT)
cursor = conn.cursor()
# recreate all the databases
for n in range(dbs):
cursor.execute("drop database if exists db%d" % n)
cursor.execute("create database db%d" % n)
# next, open a separate session to each database
conns = [make_conn("db%d" % n) for n in range(dbs)]
cursors = [conn.cursor() for conn in conns]
# set up interesting GUCs in each session
for cursor in cursors:
cursor.execute("set rollback_overflow_size = '0kB'")
cursor.connection.commit()
# now do random work at the requested rate until our time runs out
start = time.time()
finish_at = start + runtime
txs = 0
table_number = 0
now = start
while now < finish_at:
# choose a random session, and run a transaction
cursor = random.choice(cursors)
cursor.execute("create table t%d ()" % table_number)
# decide whether to commit or roll back
if random.uniform(0.0, 1.0) < commit_ratio:
cursor.connection.commit()
else:
cursor.connection.rollback()
table_number += 1
# wait until it's time to start the next transaction
txs += 1
next_tx_at = (txs / txs_per_sec) + start
if next_tx_at < now:
print "can't run transactions fast enough, not sleeping"
else:
time.sleep(next_tx_at - now)
now = time.time()
if __name__ == "__main__":
dbs = 4
txs_per_sec = 1 #100.0
commit_ratio = 0.0 # 0.0 = always roll back, 1.0 = always commit
runtime = 120 # how long to run for, in seconds
run_test(dbs, txs_per_sec, commit_ratio, runtime)