Hi all,

First of all, many thanks to all involved in web2py, it's been working out 
very well for me :)

I've been using the scheduler a lot, and I think I've found a source of 
database deadlocks.  As far as I can tell, it happens when the system is 
deleting what it thinks are dead workers because heartbeats have timed out, 
and electing a new ticker, and two or more worker processes are trying to 
do this at the same time.  They each update there own heartbeats and then 
do several update/delete operations on a number of the scheduler_worker 
rows, without intervening db.commits, and the 'many rows' operations 
collide with each other.  These can range from simple:

2015-11-16 11:02:33 GMT PID=5244 trans=72287062 ERROR:  40P01: deadlock 
detected
2015-11-16 11:02:33 GMT PID=5244 trans=72287062 DETAIL:  Process 5244 waits 
for ShareLock on transaction 72287313; blocked by process 6236.
Process 6236 waits for ShareLock on transaction 72287062; blocked by 
process 5244.
Process 5244: UPDATE scheduler_worker SET is_ticker='F' WHERE 
(scheduler_worker.worker_name <> 'UKVMAAS#6316');
Process 6236: DELETE FROM scheduler_worker WHERE 
(((scheduler_worker.last_heartbeat < '2015-11-16 11:00:44') AND 
(scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat 
< '2015-11-16 10:46:44') AND (scheduler_worker.status <> 'ACTIVE')));

to spectacular:

2015-11-16 11:02:16 GMT PID=6772 trans=72287377 ERROR:  40P01: deadlock 
detected
2015-11-16 11:02:16 GMT PID=6772 trans=72287377 DETAIL:  Process 6772 waits 
for ExclusiveLock on tuple (311,9) of relation 16681 of database 16384; 
blocked by process 564.
Process 564 waits for ShareLock on transaction 72287313; blocked by process 
6236.
Process 6236 waits for AccessExclusiveLock on tuple (0,19) of relation 
16908 of database 16384; blocked by process 6804.
Process 6804 waits for ShareLock on transaction 72287062; blocked by 
process 5244.
Process 5244 waits for ShareLock on transaction 72287388; blocked by 
process 728.
Process 728 waits for ExclusiveLock on tuple (311,9) of relation 16681 of 
database 16384; blocked by process 6772.
Process 6772: UPDATE scheduler_task SET 
status='QUEUED',assigned_worker_name='' WHERE 
((scheduler_task.assigned_worker_name IN (SELECT 
 scheduler_worker.worker_name FROM scheduler_worker WHERE 
(((scheduler_worker.last_heartbeat < '2015-11-16 11:01:01') AND 
(scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat 
< '2015-11-16 10:47:01') AND (scheduler_worker.status <> 'ACTIVE'))))) AND 
(scheduler_task.status = 'RUNNING'));
Process 564: UPDATE scheduler_task SET 
status='QUEUED',assigned_worker_name='' WHERE 
((scheduler_task.assigned_worker_name IN (SELECT 
 scheduler_worker.worker_name FROM scheduler_worker WHERE 
(((scheduler_worker.last_heartbeat < '2015-11-16 11:00:45') AND 
(scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat 
< '2015-11-16 10:46:45') AND (scheduler_worker.status <> 'ACTIVE'))))) AND 
(scheduler_task.status = 'RUNNING'));
Process 6236: DELETE FROM scheduler_worker WHERE 
(((scheduler_worker.last_heartbeat < '2015-11-16 11:00:44') AND 
(scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat 
< '2015-11-16 10:46:44') AND (scheduler_worker.status <> 'ACTIVE')));
Process 6804: UPDATE scheduler_worker SET 
status='ACTIVE',last_heartbeat='2015-11-16 
11:00:36',worker_stats='{"status": "ACTIVE", "errors": 0, "workers": 0, 
"queue": 0, "empty_runs": 11683, "sleep": 1.0, "distribution": null, 
"total": 0}' WHERE (scheduler_worker.worker_name = 'UKVMAAS#4280');
Process 5244: UPDATE scheduler_worker SET is_ticker='F' WHERE 
(scheduler_worker.worker_name <> 'UKVMAAS#6316');
Process 728: UPDATE scheduler_task SET 
status='QUEUED',assigned_worker_name='' WHERE 
((scheduler_task.assigned_worker_name IN (SELECT 
 scheduler_worker.worker_name FROM scheduler_worker WHERE 
(((scheduler_worker.last_heartbeat < '2015-11-16 11:01:03') AND 
(scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat 
< '2015-11-16 10:47:03') AND (scheduler_worker.status <> 'ACTIVE'))))) AND 
(scheduler_task.status = 'RUNNING'));
(from PostgreSQL 9.4.4 logs set up for debug, web2py 2.12.3, Windows 10)

This seems to happen more often than you'd hope because the earlier 
database operations tend to synchronise multiple workers in time if they're 
already waiting on a lock.  The worst case I've found is to set the 
deadlock timeout in PostgreSQL longer than the heartbeat timeout, so a 
number of workers are released when the DB times out the deadlocked 
transaction.  This can get stuck in a loop where it immediately recreates 
the same problem.

If this makes sense, is it possible to split up send_heartbeat into more 
transactions without introducing other problems?

Many thanks,

Andy S.

PS My heartbeats time out because this is a VM that occasionally gets 
starved of resource, so not web2py's fault :)

-- 
Resources:
- http://web2py.com
- http://web2py.com/book (Documentation)
- http://github.com/web2py/web2py (Source code)
- https://code.google.com/p/web2py/issues/list (Report Issues)
--- 
You received this message because you are subscribed to the Google Groups 
"web2py-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to web2py+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to