[issue17969] multiprocessing crash on exit

2013-05-13 Thread Kristján Valur Jónsson

New submission from Kristján Valur Jónsson:

We have observed this crash with some frequency when running our compilation 
scripts using multiprocessing.Pool()

By analysing the crashes, this is what is happening:
1) The Pool has a "daemon" thread managing the pool.
2) the worker is asleep, waiting for the GIL
3) The main thread exits.  The system starts its shutdown. During 
PyInterpreterState_Clear, it has cleared among other things the sys dict.  
During this, it clears an old traceback.  The traceback contains a 
multiprocessing.connection object.
4) The connection object is cleared.  It it contains this code:
Py_BEGIN_ALLOW_THREADS
CLOSE(self->handle);
Py_END_ALLOW_THREADS
5) The sleeping daemon thread is woken up and starts prancing around.  Upon 
calling sys.exc_clear() it crashes, since the tstate->interp->sysdict == NULL.


I have a workaround in place in our codebase:


static void
connection_dealloc(ConnectionObject* self)
{
if (self->weakreflist != NULL)
PyObject_ClearWeakRefs((PyObject*)self);

if (self->handle != INVALID_HANDLE_VALUE) {
/* CCP Change.  Cannot release threads here, because this
 * deallocation may be running during process shutdown, and
 * releaseing a daemon thread will cause a crash
Py_BEGIN_ALLOW_THREADS
CLOSE(self->handle);
Py_END_ALLOW_THREADS
 */
CLOSE(self->handle);
}
PyObject_Del(self);
}


In general, deallocators should have no side effects, I think.  Releaseing the 
GIL is certainly a side effect.

I realize that process shutdown is a delicate matter.  One delicate thing is 
that we cannot allow worker threads to run anymore.  I see no general mechanism 
for ensuring this, but surely at least not releasing the GIL for deallocators 
is a first step?

--
messages: 189123
nosy: kristjan.jonsson
priority: normal
severity: normal
status: open
title: multiprocessing crash on exit
type: crash
versions: Python 2.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-13 Thread Richard Oudkerk

Changes by Richard Oudkerk :


--
nosy: +sbt

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-13 Thread Richard Oudkerk

Richard Oudkerk added the comment:

> In general, deallocators should have no side effects, I think.  
> Releaseing the GIL is certainly a side effect.

Notice that socket and file objects also release the GIL when being 
deallocated.  At least for sockets close() can block (e.g. if you you use the 
SO_LINGER option).

I am not sure whether we can ignore the possibility for connection objects.

> I realize that process shutdown is a delicate matter.  One delicate 
> thing is that we cannot allow worker threads to run anymore.  I see no 
> general mechanism for ensuring this, but surely at least not releasing 
> the GIL for deallocators is a first step?

I agree that shutdown matters are delicate, particularly when daemon threads 
are involved.  In fact I'm starting to agree with Antoine that daemon threads 
are evil and should be avoided wherever possible.

P.S. I think in Python 3.x this thread switching is stopped (by setting 
_Py_Finalizing to something non-NULL) before PyInterpreterState_Clear() is run.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-13 Thread Kristján Valur Jónsson

Kristján Valur Jónsson added the comment:

I think that socket.close() is the exception rather than the rule here.  
What kind of handle is this?  It can't be a socket, since that would require 
closesocket.

Also, even though an IO call _can_ block, that doesn't mean that we _must_ 
release the gil for the duration.

I´m not very familiar with multiprocessing, I'm mainly trying to enhance 
robustness with our build tools here.  Would an alternative fix, making the 
worker thread a non-daemon, be hard to do?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-13 Thread Charles-François Natali

Charles-François Natali added the comment:

> Also, even though an IO call _can_ block, that doesn't mean that
> we _must_ release the gil for the duration.

Yes, otherwise some people will complain when the whole interpreter is stuck 
while a socket/NFS file handle/whatever is shutdown.

This problem isn't specific to multiprocessing: for example, if a daemon 
thread's thread state is cleared as part of the shutdown procedure, and one of 
the TLS object releases the GIL (e.g. a database connection), you'll have the 
same kind of problem.

AFAICT, it's "solved" in Python 3 because, as Richard said, you can't acquire 
the GIL once the interpreter is shutting down.
Daemon threads are *really* tricky, since they can wake-up while the 
interpreter is shutting down.
So they should be avoided as much as possible.
We can also try to think some more about a way to avoid/limit this type of 
issue, but it's not trivial...

--
nosy: +neologix, pitrou

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-13 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Yes, I think it's too delicate to change in 2.7 right now. As Charles-François 
said, daemon threads should be much less crashy in 3.3.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-13 Thread Kristján Valur Jónsson

Kristján Valur Jónsson added the comment:

Well, knowing that they crash less in 3.3 doesn't really fix the problem now, 
does it?

We can consider two options then:
1) A multiprocessing specific fix.  Removing this handle close gil release 
(which is superfluous, since these calls aren't blocking in any real sense) 
will certainly remove _this_ instance of the crash.  An alternative is to make 
this worker thread non-daemon.  That shouldn't be too hard and shouldn't have 
any other side effects.

2) A general daemon thread fix.  For example, removing the GIL at the start of 
the shutdown process will make it impossible to release it.  We can do this by 
setting interpreter_lock to NULL.

I don't see the point of having 2.7 in bug fix mode if we can't fix bugs.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-13 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Le lundi 13 mai 2013 à 16:14 +, Kristján Valur Jónsson a écrit :
> I don't see the point of having 2.7 in bug fix mode if we can't fix
> bugs.

Delicate bug fixes may entail regressions, and we've had enough of them
in 2.7.4. You've already patched your own Python anyway.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-13 Thread Richard Oudkerk

Richard Oudkerk added the comment:

> We can consider two options then:
> 1) A multiprocessing specific fix.  Removing this handle close gil 
> release (which is superfluous, since these calls aren't blocking in any 
> real sense) will certainly remove _this_ instance of the crash.  An 
> alternative is to make this worker thread non-daemon.  That shouldn't 
> be too hard and shouldn't have any other side effects.

Have you tried doing

p = Pool()
try:
...
finally:
p.close()   # or perhaps p.terminate()
p.join()

Actually, making the worker thread non-daemonic is not so simple.  This is 
because there is no way to interrupt a background thread which is blocked doing 
IO (unless you use multiplexing which is not possible with non-overlapped 
pipes).

One can try to unblock such background threads by reading/writing messages 
from/to the result/task pipe, but it not straightforward.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-14 Thread Kristján Valur Jónsson

Kristján Valur Jónsson added the comment:

Catching regressions is what we have the regression tests for.  If it is not in 
caught by the regression tests, then it is not a regression, by definition.
Bug fix mode is for fixing bugs, IMHO.
Yes, I have patched my local version.  The reason I am here having this 
discussion is that I want to contribute to help others that are using 2.7 for 
multiprocessing.  Others will have the same problem, and probably have, already.

Anyway, I cannot easily reproduce the problem, it happens regularly in a live 
production environment.  My patch is a conjecture based patch.

But I actually had a different thought about how to handle this.
The particular manifestation of this error occurs because an exception state is 
being cleared from the system dict.  The dict contains a frame and there is 
where the connection object is being held.
The problem can be avoided, by clearing the exception right at the start of the 
PyInterpreterState_Clear(), thus flushing out most side effects right at the 
start.

I'll prepare a patch for you to review.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-14 Thread Richard Oudkerk

Richard Oudkerk added the comment:

Kristjan, could you confirm whether joining the pool explicitly before shutdown 
(in the way I suggested earlier) fixes the problem.  I think it should -- at 
shutdown you won't switch to a thread if it has already been joined.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-14 Thread Antoine Pitrou

Antoine Pitrou added the comment:

> Catching regressions is what we have the regression tests for.  If it
> is not in caught by the regression tests, then it is not a
> regression, by definition.

Call it what you want :-) The bottom line is that we'll release a
2.7.5 soon because of breakage introduced in 2.7.4. Whether or not
it's a "regression" according to some platonic definition isn't
very important here.

> The problem can be avoided, by clearing the exception right at the
> start of the PyInterpreterState_Clear(), thus flushing out most side
> effects right at the start.
> 
> I'll prepare a patch for you to review.

I think this kind of stuff is too tricky to risk breaking 2.7
once again with it.
If 3.x has the same issue, then a patch is helpful, otherwise not IMHO.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-14 Thread Charles-François Natali

Charles-François Natali added the comment:

> Catching regressions is what we have the regression tests for.  If it is not 
> in caught by the regression tests, then it is not a regression, by definition.

You do realize this sentence doesn't make sense, do you?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-14 Thread Kristján Valur Jónsson

Kristján Valur Jónsson added the comment:

Richard, I'll review implement your change.  It is a bit tricky to test this, 
since I can only tell after a few days if the particular (rare) problem has 
been fixed.  The crash is a rare one but consistently happens with some 
probability we use multiprocessing to perform certain batch processing.

About regressions:  The term means that problems, previously fixed, become 
broken again.  All fixes should reasonably have appropriate tests in the 
regression test suite.

I don't know what particular "regressions" you have been battling since 2.7.4.  
I hope they are all testable by now.
I only know that there is a exit problem with multiprocessing that 
statistically happens and is problemematic.  It caused use to stop using 
multiprocessing and parallel jobs until I could diagnose the problem.  I have 
suggested several fixes to this particular problem.

I have two favorites, which would also help in the general case:
1) calling sys.exc_clear() at the beginning of the python finalize part, to 
throw out any lingering traceback and frame objects.  This will cause side 
effects such as gil-twiddling to happen early, and without consequence.  
sys.exc_clear() is a safe api and used throughout the code, an extra call just 
prior to exit should be a very safe operation.
2) Turn off multi-threading at the start of python exit, by setting 
interpreter_lock to NULL.  Again, this is a no-brainer, since the NULL value is 
already well defined and works well.  It will cause all subsequent GIL releases 
to be no-ops.

I personally favor the second one.  It makes no sense to allow other threads to 
run after finalization has started and forcing them to stay dead is only 
prudent.

2.7 is not frozen, and it is in bug fix mode.  If a solid bug fix to an actual 
problem is proposed, we should embrace it, and deal with any eventual fallout 
appropriate as the price of doing business.  2.7 is a product that is alive and 
well, used by millions of people and thousands of businesses and will remain so 
for quite some time.  I know that most of the core devs have moved on to 
greener pastures, but I for one would like to stay loyal to this workhorse of a 
product and continue to make necessary improvements to it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17969] multiprocessing crash on exit

2013-05-14 Thread Kristján Valur Jónsson

Kristján Valur Jónsson added the comment:

Richard, reading the multiprocessing.py code, it appears that your suggestion 
of closign and joining the Pool() object should also do the trick.  Doing that 
will certainly also fix this particular case.  I'll implement that in our local 
application code.

I'm still not happy that 2.7 has a potential exit crash if a daemon thread is 
left hanging :(

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com