[GENERAL] Possible multiprocess lock/unlock-loop problem in Postgresql 9.2

Yngve N. Pettersen Sat, 04 Jan 2014 03:16:09 -0800

Hello all,

I am running a Postgresql 9.2 system (IIRC v9.2.4; I am about to upgradeto 9.2.6) on a system with 32-cores, 256GB RAM, 64GB shared RAM forpostgresql. The applications I am running are written using Django(currently v1.5)

For a while I have been observing what may be special kind of deadlock, orperhaps more precisely a lock-unlock loop among a group of postgresqlprocesses performing similar, possibly overlapping queries.

I am wondering if this is a known issue? If so, are there any workaroundsaside from the ones I am currently using myself? I did not see somethinglike this listed in the lock section of the Bug Todo list.

I have so far observed this in two specific query scenarios, and theproblem seem to trigger at 10 or more processes performing similar queries(with 8 processes I do not observe the problem). When this happens, 10 ofthe processes (the rest will be in waiting state) will run at 100% CPU(one full core) without apparently being able to complete the operation.IIRC I have seen this go on for hours (even for queries that shouldcomplete in fractions of a second) before I have killed the postgresprocesses (they are so busy they do not detect that the connection hasbeen closed by killing the remote process).

In one of the cases, an UPDATE operation, the problem only appear tohappen during the first such updates, when all relevant entries are in thesame state, not later, so I have been able to avoid the problem by makingsure that initially there are no more than 8 active processes working onthe database. IIRC even just killing the processes and restarting the jobwithout any changes to process counts seem to avoid the problem, but thatdoes not avoid the initial startup problem. The queries are (supposed to)start in a random staggered sequence several seconds apart just to avoidtoo many operations at the same time.

The second case is a multi-table SELECT, and again running 10 or moreprocesses seem to trigger the problem, and I have had to reduce the numberof parallel queries to 8 to avoid the issues. In this case it might bepossible to randomize the sequence of queries, but that still couldaccidentally trigger a similar 10+ parallel query case, and all of thequeries will always access at least one of the tables. I am investigatingother alternatives for this case.

At present I have not had time to create a testcase for this, but I cangive some details of the UPDATE scenario.

The relevant fields in the table (a job queue list) are the unique"id"-field and a "state" field with 3 values (Idle, Started, Finished).

The operation have two steps, first a retrieval of a group id's forcurrently idle entries, then an update of the records with those IDs tostate started, assuming they are still idle (to avoid multiple allocationsof the same job), returning a list of updated item ids.


The update query looks like this:

UPDATE queue SET state = E'S' WHERE state = E'I' AND id IN (<list ofintegers>) RETURNING id;

There is a BEGIN/COMMIT wrap around the operation, including the SELECTquery.

My guess is that the active processes get into a lock/unlock loopregarding the "state" field because the list of ids overlap, and for somereason, instead of completing the operation according to some kind ofpriority order.

I have no real guess about what is causing the SELECT operations to loopin that fashion, since there is no write operations that could cause thekind of apparent loop I see in the UPDATE operation. A wild guess could bethat the SELECT processes are passing around the operation for creating acommon table join for linking records from two (or more tables), or justretrieval of records from a single table, since the operation overlaptables and sub selections from those tables. If this guess have somebearing on what is going on, it may also be something similar that isgoing on in the UPDATE case, e.g. an attempt to collect all the affectedrecords, and the job just gets passed on to the next process due to somelogic.


Any ideas about what is going on?

--
Using Opera's mail client: http://www.opera.com/mail/


--
Sent via pgsql-general mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Possible multiprocess lock/unlock-loop problem in Postgresql 9.2

Reply via email to