Re: [Python-Dev] deja-vu .. python locking
Greg Ewing wrote: Martin Devera wrote: Greg, what change do you have in mind regarding that 3 instruction addition to refcounting ? I don't have any change in mind. If even an atomic inc is too expensive, it seems there's no hope for us. Just from curiosity, would be a big problem removing refcounting and live with garbage collection only ? I'm not sure if some parts of py code depends on exact refcnt behaviour (I guess it should not). Probably not for mainstream, but maybe as compile time option as part of freethreading solution only for those who need it. Even if you can do fast atomic inc/dec, it forces cacheline with refcounter to ping-pong between caches of referencing cpus (for read only class dicts for example) so that you can probably never get good SMP scalability. Consider main memory latency 100ns, then on 8 way 2GHz SMP system where paralel computation within the same py class is going on all cpus. When you manage to do a lot of class references in a loop, say 6400 instructions apart (quite realistic) then at least one CPU each time will block on that inc/dec, so that you lost one cpu in overhead... ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
Martin Devera schrieb: Just from curiosity, would be a big problem removing refcounting and live with garbage collection only ? I'm not sure if some parts of py code depends on exact refcnt behaviour (I guess it should not). Now, this gives a true deja-vu. Python applications often rely on reference counting (in particular, that releasing a file object will immediately close the file), despite the language reference saying that this is not a Python feature, just one of the implementation. In addition, implementing a tracing garbage collection would either be tedious or have additional consequences on semantics: with a conservative GC, some objects may never get collected, with a precise GC, you have to declare GC roots on the C level. Things get more complicated if the GC is also compacting. See the current thread on the py3k list. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
Martin Devera [EMAIL PROTECTED] wrote: [snip] Even if you can do fast atomic inc/dec, it forces cacheline with refcounter to ping-pong between caches of referencing cpus (for read only class dicts for example) so that you can probably never get good SMP scalability. That's ok. Why? Because according to Guido, the GIL isn't going away: http://mail.python.org/pipermail/python-3000/2006-April/001072.html ... so ruminations about refcounting, GC, etc., at least with regards to removing the GIL towards some sort of free threading Python, are likely to go nowhere. Unless someone is able to translate the codebase into using such methods, show how it is not (significantly) more difficult to program extensions for, show a mild to moderate slowdown on single processors, and prove actual speedup on multiple processors. But even then it will be a difficult sell, as it would require possibly radical rewrites for all of the hundreds or thousands of CPython extensions currently being developed and maintained. - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
Greg Ewing wrote: Martin Devera wrote: As I've written in Big reader lock paragraph of the original proposal, these objects could be handled by not blocking in read path But as was just pointed out, because of refcounting, there's really no such thing as read-only access to an object. What *looks* like read-only access at the Python level involves refcount updates just from the act of touching the object. Yes I was thinking about atomic inc/dec (locked inc/dec in x86) as used in G.Stein's patch. I have to admit that I haven't measured its performance, I was hoping for decent one. But from http://www.linuxjournal.com/article/6993 it seems that atomic inc is rather expensive too (75ns on 1.8GHz P4) :-( Greg, what change do you have in mind regarding that 3 instruction addition to refcounting ? thanks, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
Ah, I think I understand now. First the minor critique: I believe the locking algorithm isn't thread-safe: while (ob-owner_thread != self_thread()) { unlock_mutex(thread_mutex[self_thread()]) // wait for owning thread to go to quiscent state lock_mutex(thread_mutex[ob-owner_thread]) ob-owner_thread = self_thread() unlock_mutex(thread_mutex[ob-owner_thread]) lock_mutex(thread_mutex[self_thread()]) } If two threads are competing for the same object held by a third thread, they may simultaneously enter the while loop, and then simultaneously try to lock the owner_thread. Now, one will win, and own the object. Later, the other will gain the lock, and unconditionally overwrite ownership. This will cause two threads to own the objects, which is an error. oops .. well it seems as very stupid error on my side. Yes you are absolutely right, I'll have to rethink it. I hope it is possible to do it in correct way... The more fundamental critique is: Why? It seems you do this to improve efficiency, (implicitly) claiming that it is more efficient to keep holding the lock, instead of releasing and re-acquiring it each time. I claim that this doesn't really matter: any reasonable mutex implementation will be fast if there is no lock contention. On locking, it will not invoke any system call if the lock is currently not held (but just atomically test-and-set some field of the lock); on unlocking, it will not invoke any system call if the wait list is empty. As you also need to test, there shouldn't be much of a performance difference. I measured it. Lock op in futex based linux locking is of the same speed as windows critical section and it is about 30 cycles on my P4 1.8GHz in uncontented case. As explained in already mentioned http://www.linuxjournal.com/article/6993 it seems due to pipeline flush during cmpxchg insn. And there will be cacheline transfer penalty which is much larger. So that mutex locking will take time comparable with protected code itself (assuming fast code like dict/list read). Single compare will take ten times less. Am I missing something ? thanks, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
Martin Devera wrote: Greg, what change do you have in mind regarding that 3 instruction addition to refcounting ? I don't have any change in mind. If even an atomic inc is too expensive, it seems there's no hope for us. -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] deja-vu .. python locking
Hello, as someone has written in FAQ, sometimes someone starts a thread about finer grained locking in Python. Ok here is one. I don't want to start a flamewar. I only seek suggestions and constructive critic. I have some ideas whose are new in this context (I believe) and I only wanted to make them public in case someone finds them interesting. Comments are welcome. Martin Round 1, Greg Stein's patch The patch removes GIL from version 1.6 and replaces locking of list, dict and other structures with finer grained locking. The major slowdown seems to be in list and dict structures, dicts are used for object attributes and these are accessed quite often. Because (IIUC) mutex struct is quite heavy, dict and list are locked via pool of locks. When you lock this pooled lock you have to lock two locks in reality. One locks pool itself, and other locks the pooled lock (the second locking can be omited in non contended case because locks in the pool are in locked state). One lock take about 25 cycles on UP P4 (mainly pipeline flush during memory barrier) and can be even more expensive (hundreds of cycles) due to cacheline move between CPUs on SMP machine. Global pool lock is subject to cacheline pinpong as it will be often reacquired by competing CPUs. In mappinglookup there is lookmapping guarded by this locking scheme, lookmapping itself has about 20 cycles in the best (one hope typical) case plus compareobj cost (in case of string keys say ... 50..100 cycles?). Thus locking/unlocking the read takes 50..100 cycles and operation itself is 70-120 cycles. One might expect about 50% slowdown in dict read path. RCU like locking Solution I have in mind is similar to RCU. In Python we have quiscent state - when a thread returns to main loop of interpreter. Let's add owner_thread field to locked object. It reflects last thread (its id) which called any lockable method on the object. Each LOCK operation looks like: while (ob-owner_thread != self_thread()) { unlock_mutex(thread_mutex[self_thread()]) // wait for owning thread to go to quiscent state lock_mutex(thread_mutex[ob-owner_thread]) ob-owner_thread = self_thread() unlock_mutex(thread_mutex[ob-owner_thread]) lock_mutex(thread_mutex[self_thread()]) } Unlock is not done - we own the object now and can use it without locking (until we return to interpreter loop or we call LOCK on other object). For non-shared objects there is only penalty of ob-owner_thread != self_thread() condition. Not sure about Windows, but in recent Linuxes one can use %gs register as thread id, thus compare is about 3 cycles (and owner_thread should be in object's cacheline anyway). In contended case there is some cache pingpong with ob and mutex but it is as expected. Deadlocks Our object ownership is long - from getting it in LOCK to next quiscent state of the thread. Thus when two threads want to step each on other's object, they will deadlock. Simple solution is to extend set of quiscent states. It is when thread releases its thread_mutex in main loop (and immediately reacquires). Additionaly it can release it just before it is going to wait on another thread's mutex, like in LOCK (already in code above). If you use LOCK correctly then when you are LOCKing an object you can't be in vulnerable part of OTHER object. So that let other threads to get ownership of your own objects in that time. One can also want to release his lock when going to lock mutex in threading package and in other places where GIL is released today. However I admit that I did no formal proof regarding deadlock, I plan to do it if nobody can find other flaw in the proposal. Big reader lock While above scheme might work well, it'd impose performance penalty for shared dicts which are almost read only (module.__dict__). For these similar locking can be used, only writer has to wait until ALL other threads enter quiscent state (take locks of them), then perform change and unlock them all. Readers can read without any locking. Compatibilty with 3rd party modules I've read this argument on pydev list. Maybe I'm not understanding something, but is it so complex for Py_InitModule4 to use extra flag in apiver for example ? When at least one non-freethreaded module is loaded, locking is done in old good way... ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
Martin Devera schrieb: RCU like locking Solution I have in mind is similar to RCU. In Python we have quiscent state - when a thread returns to main loop of interpreter. There might be a terminology problem here. RCU is read-copy-update, right? I fail to see the copy (copy data structure to be modified) and update (replace original pointer with pointer to copy) part. Do this play a role in that scheme? If so, what specific structure is copied for, say, a list or a dict? This confusion makes it very difficult for me to understand your proposal, so I can't comment much on it. If you think it could work, just go ahead and create an implementation. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
Martin v. Löwis wrote: Martin Devera schrieb: RCU like locking Solution I have in mind is similar to RCU. In Python we have quiscent state - when a thread returns to main loop of interpreter. There might be a terminology problem here. RCU is read-copy-update, right? I fail to see the copy (copy data structure to be modified) and update (replace original pointer with pointer to copy) part. Do this play a role in that scheme? If so, what specific structure is copied for, say, a list or a dict? This confusion makes it very difficult for me to understand your proposal, so I can't comment much on it. If you think it could work, just go ahead and create an implementation. It is why I used a word similar. I see the similarity in a way to archieve safe delete phase of RCU. Probably I selected bad title for the text. It is because I was reading about RCU implementation in Linux kernel and I discovered that the idea of postponing critical code to some safe point in future might work in Python interpreter. So that you are right. It is not RCU. It only uses similar technique as RCU uses for free-ing old copy of data. It is based on assumption that an object is typicaly used by single thread. You must lock it anyway just for case if another thread steps on it. The idea is that each object is owned by a thread. Owner can use its objects without locking. If a thread wants to use foreign object then it has to wait for owning thread to go to some safe place (out of interpreter, into LOCK of other object..). It is done by per-thread lock and it is neccessary because owner does no locking, thus you can be sure that nobody it using the object when former owner is somewhere out of the object. Regarding implementation, I wanted to look for some opinions before starting to implement something as big as this patch. Probably someone can look and say, hey it is stupit, you forgot that FILL_IN ... ;-) I hope I explained it better this time, I know my English not the best. At least worse than my Python :-) thanks for your time, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
On Mon, 18 Sep 2006 17:06:47 +0200, Martin Devera [EMAIL PROTECTED] wrote: Martin v. Löwis wrote: Martin Devera schrieb: RCU like locking Solution I have in mind is similar to RCU. In Python we have quiscent state - when a thread returns to main loop of interpreter. There might be a terminology problem here. RCU is read-copy-update, right? I fail to see the copy (copy data structure to be modified) and update (replace original pointer with pointer to copy) part. Do this play a role in that scheme? If so, what specific structure is copied for, say, a list or a dict? This confusion makes it very difficult for me to understand your proposal, so I can't comment much on it. If you think it could work, just go ahead and create an implementation. It is why I used a word similar. I see the similarity in a way to archieve safe delete phase of RCU. Probably I selected bad title for the text. It is because I was reading about RCU implementation in Linux kernel and I discovered that the idea of postponing critical code to some safe point in future might work in Python interpreter. So that you are right. It is not RCU. It only uses similar technique as RCU uses for free-ing old copy of data. It is based on assumption that an object is typicaly used by single thread. Which thread owns builtins? Or module dictionaries? If two threads are running the same function and share no state except their globals, won't they constantly be thrashing on the module dictionary? Likewise, if the same method is running in two different threads, won't they thrash on the class dictionary? Jean-Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
So that you are right. It is not RCU. It only uses similar technique as RCU uses for free-ing old copy of data. It is based on assumption that an object is typicaly used by single thread. Which thread owns builtins? Or module dictionaries? If two threads are running the same function and share no state except their globals, won't they constantly be thrashing on the module dictionary? Likewise, if the same method is running in two different threads, won't they thrash on the class dictionary? As I've written in Big reader lock paragraph of the original proposal, these objects could be handled by not blocking in read path and wait for all other threads to come home before modifying. The selection between locking mode could be selected either by something like __locking__ or by detecting the mode. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
At 07:08 PM 9/18/2006 +0200, Martin Devera wrote: So that you are right. It is not RCU. It only uses similar technique as RCU uses for free-ing old copy of data. It is based on assumption that an object is typicaly used by single thread. Which thread owns builtins? Or module dictionaries? If two threads are running the same function and share no state except their globals, won't they constantly be thrashing on the module dictionary? Likewise, if the same method is running in two different threads, won't they thrash on the class dictionary? As I've written in Big reader lock paragraph of the original proposal, these objects could be handled by not blocking in read path and wait for all other threads to come home before modifying. Changing an object's reference count is modifying it, and most accesses to get the dictionaries themselves involve refcount changes. Your plan, so far, does not appear to have any solution for reducing this overhead. Module globals aren't so bad, in that you'd only have to lock and refcount when frames are created and destroyed. But access to class dictionaries to obtain methods happens a lot more often, and refcounting is involved there as well. So, I think for your plan to work, you would have to eliminate reference counting, in order to bring the lock overhead down to a manageable level. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
Martin Devera schrieb: It is based on assumption that an object is typicaly used by single thread. You must lock it anyway just for case if another thread steps on it. The idea is that each object is owned by a thread. Owner can use its objects without locking. If a thread wants to use foreign object then it has to wait for owning thread to go to some safe place (out of interpreter, into LOCK of other object..). It is done by per-thread lock and it is neccessary because owner does no locking, thus you can be sure that nobody it using the object when former owner is somewhere out of the object. Ah, I think I understand now. First the minor critique: I believe the locking algorithm isn't thread-safe: while (ob-owner_thread != self_thread()) { unlock_mutex(thread_mutex[self_thread()]) // wait for owning thread to go to quiscent state lock_mutex(thread_mutex[ob-owner_thread]) ob-owner_thread = self_thread() unlock_mutex(thread_mutex[ob-owner_thread]) lock_mutex(thread_mutex[self_thread()]) } If two threads are competing for the same object held by a third thread, they may simultaneously enter the while loop, and then simultaneously try to lock the owner_thread. Now, one will win, and own the object. Later, the other will gain the lock, and unconditionally overwrite ownership. This will cause two threads to own the objects, which is an error. The more fundamental critique is: Why? It seems you do this to improve efficiency, (implicitly) claiming that it is more efficient to keep holding the lock, instead of releasing and re-acquiring it each time. I claim that this doesn't really matter: any reasonable mutex implementation will be fast if there is no lock contention. On locking, it will not invoke any system call if the lock is currently not held (but just atomically test-and-set some field of the lock); on unlocking, it will not invoke any system call if the wait list is empty. As you also need to test, there shouldn't be much of a performance difference. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
Martin Devera wrote: Regarding implementation, I wanted to look for some opinions before starting to implement something as big as this patch. Probably someone can look and say, hey it is stupit, you forgot that FILL_IN ... ;-) If I understand correctly, your suggestion for avoiding deadlock relies on the fact that a given thread can really only have one object locked at a time, i.e. after you LOCK an object you can only assume you own it until you LOCK another object or return to some quiescent state. Is this right? If so, the question is whether it's sufficient to be able to lock just one object at a time. Maybe it is, but some more formal consideration of that might be a good idea. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
Martin Devera wrote: As I've written in Big reader lock paragraph of the original proposal, these objects could be handled by not blocking in read path But as was just pointed out, because of refcounting, there's really no such thing as read-only access to an object. What *looks* like read-only access at the Python level involves refcount updates just from the act of touching the object. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] deja-vu .. python locking
Phillip J. Eby wrote: So, I think for your plan to work, you would have to eliminate reference counting, in order to bring the lock overhead down to a manageable level. There's a possibility it wouldn't be atrociously bad. Seems like it would only add the 3 instructions or whatever overhead to most refcount operations. How much this would reduce performance depends on what percentage of time is currently used by refcounting. Are there any figures for that? A quick way of getting an idea of how much effect it would have might be to change Py_INCREF and Py_DECREF to go through the relevant motions, and see what timings are produced for single-threaded code. It wouldn't be a working implementation, but you'd find out pretty quickly if it were going to be a disaster. -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | Carpe post meridiem! | Christchurch, New Zealand | (I'm not a morning person.) | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com