[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

Ask Solem Mon, 12 Jul 2010 14:24:56 -0700

Ask Solem <a...@opera.com> added the comment:

> Unfortunately, if you've lost a worker, you are no
> longer guaranteed that cache will eventually be empty.
> In particular, you may have lost a task, which could
> result in an ApplyResult waiting forever for a _set call.


> More generally, my chief assumption that went into this
> is that the unexpected death of a worker process is
> unrecoverable. It would be nice to have a better workaround
> than just aborting everything, but I couldn't see a way
> to do that.

It would be a problem if the process simply disappeared,
But in this case you have the ability to put a result on the queue,
so it doesn't have to wait forever.

For processes disappearing (if that can at all happen), we could solve
that by storing the jobs a process has accepted (started working on),
so if a worker process is lost, we can mark them as failed too.

> I could be wrong, but that's not what my experiments
> were indicating. In particular, if an unpickleable error occurs,
> then a task has been lost, which means that the relevant map,
> apply, etc. will wait forever for completion of the lost task.

It's lost now, but not if we handle the error...
For a single map operation this behavior may make sense, but what about
someone running the pool as s long-running service for users to submit map 
operations to? Errors in this context are expected to happen, even unpickleable 
errors.

I guess that the worker handler works as a supervisor is a side effect,
as it was made for the maxtasksperchild feature, but for me it's a welcome one. 
With the supervisor in place, multiprocessing.pool is already fairly stable to 
be used for this use case, and there's not much to be done to make it solid 
(Celery is already running for months without issue, unless there's a pickling 
error...)

> That does sound useful. Although, how can you determine the
> job (and the value of i) if it's an unpickleable error?
> It would be nice to be able to retrieve job/i without having
> to unpickle the rest.

I was already working on this issue last week actually, and I managed
to do that in a way that works well enough (at least for me):
http://github.com/ask/celery/commit/eaa4d5ddc06b000576a21264f11e6004b418bda1#diff-1

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue9205>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

Reply via email to