Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

Gunther Sat, 23 Feb 2019 13:08:32 -0800

Hi,

I am using an SQL queue for distributing work to massively parallelworkers. Workers come in servers with 12 parallel threads. One of thoseworker sets handles 7 transactions per second. If I add a second one,for 24 parallel workers, it scales to 14 /s. Even a third, for 36parallel workers, I can add to reach 21 /s. If I try a fourth set, 48workers, I end up in trouble. But that isn't even so much my problemrather than the fact that in short time, the performance willdeteriorate, and it looks like that is because the queue indexdeteriorates and needs a REINDEX.


The queue table is essentially this:

CREATE TABLE Queue (
  jobId bigint,
  action text,
  pending boolean,
  result text
);

the dequeue operation is essentially this:

BEGIN

SELECT jobId, action
  FROM Queue
  WHERE pending
  FOR UPDATE SKIP LOCKED

which is a wonderful concept with the SKIP LOCKED.

Then I perform the action and finally:

UPDATE Queue
   SET pending = false,
       result = ?
 WHERE jobId = ?

COMMIT

I thought to keep my index tight, I would define it like this:

CREATE UNIQUE INDEX Queue_idx_pending ON Queue(jobId) WHERE pending;

so that only pending jobs are in that index.

When a job is done, follow up work is often inserted into the Queue aspending, thus adding to that index.


Below is the performance chart.

The blue line at the bottom is the db server.

You can see the orange line is the first worker server with 12 threads.It settled into a steady state of 7/s ran with 90% CPU for some 30 min,and then the database CPU% started climbing and I tried to rebuild theindexes on the queue, got stuck there, exclusive lock, no jobs wereprocessing, but the exclusive lock was never obtained for too long. So Ishut down the worker server. Database quiet I could resolve the messedup indexes and restarted again. Soon I added a second worker server(green line) starting around 19:15. Once settled in they were pulling14/s together. but you can see in just 15 min, the db server CPU %climbed again to over 40% and the performance of the workers dropped,their load falling to 30%. Now at around 19:30 I stopped them all,REINDEXed the queue table and then started 3 workers serverssimultaneously. They settled in to 21/s but in just 10 min again thedeterioration happened. Again I stopped them all, REINDEXed, and nowstarted 4 worker servers (48 threads). This time 5 min was not enough tosee them ever settling into a decent 28/s transaction rate, but I guessthey might have reached that for a minute or two, only for the indexdeteriorating again. I did another stop now started only 2 servers andagain, soon the index deteriorated again.


Clearly that index is deteriorating quickly, in about 10,000 transactions.

BTW: when I said 7/s, it is in reality about 4 times as manytransactions, because of the follow up jobs that also get added on thisqueue. So 10,0000 transactions may be 30 or 40 k transactions before theindex deteriorates.

Do you have any suggestion how I can avoid that index deteriorationproblem smoothly?

I figured I might just pause all workers briefly to schedule the REINDEXQueue command, but the problem with this is that while the transactionvolume is large, some jobs may take minutes to process, and in that casewe need to wait minutes to quiet the database with then 47 workerssitting as idle capacity waiting for the 48th to finish so that theindex can be rebuilt!

Of course I tried to resolve the issue with vacuumdb --analyze (just incase if the autovacuum doesn't act in time) and that doesn't help.Vacuumdb --full --analyze would probably help but can't work because itrequired an exclusive table lock.


I tried to just create a new index of the same

CREATE UNIQUE INDEX Queue_idx2_pending ON Queue(jobId) WHERE pending;
DROP INDEX Queue_idx_pending;
ANALYZE Queue;

but with that I got completely stuck with two indexes where I could notremove either of them for those locking issues. And REINDEX will give mea deadlock error rightout.


I am looking for a way to manage that index so that it does not deteriorate.

May be if I was not defining it with

... WHERE pending;

then it would only grow, but never shrink. May be that helps somehow? Idoubt it though. Adding to an index also causes deterioration, and mostof the rows would be irrelevant because they would be past work. Itwould be nicer if there was another smooth way.


regards,
-Gunther

Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

Reply via email to