Re: [Python-Dev] deja-vu .. python locking

2006-09-20 Thread Martin Devera
Greg Ewing wrote:
 Martin Devera wrote:
 
 Greg, what change do you have in mind regarding that 3 instruction
 addition to refcounting ?
 
 I don't have any change in mind. If even an atomic inc
 is too expensive, it seems there's no hope for us.

Just from curiosity, would be a big problem removing refcounting and live
with garbage collection only ? I'm not sure if some parts of py code
depends on exact refcnt behaviour (I guess it should not).
Probably not for mainstream, but maybe as compile time option as part
of freethreading solution only for those who need it.
Even if you can do fast atomic inc/dec, it forces cacheline with
refcounter to ping-pong between caches of referencing cpus (for read only
class dicts for example) so that you can probably never get good SMP
scalability.
Consider main memory latency 100ns, then on 8 way 2GHz SMP system where
paralel computation within the same py class is going on all cpus.
When you manage to do a lot of class references in a loop, say 6400
instructions apart (quite realistic) then at least one CPU each time
will block on that inc/dec, so that you lost one cpu in overhead...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-20 Thread Martin v. Löwis
Martin Devera schrieb:
 Just from curiosity, would be a big problem removing refcounting and live
 with garbage collection only ? I'm not sure if some parts of py code
 depends on exact refcnt behaviour (I guess it should not).

Now, this gives a true deja-vu. Python applications often rely on
reference counting (in particular, that releasing a file object
will immediately close the file), despite the language reference
saying that this is not a Python feature, just one of the
implementation. In addition, implementing a tracing garbage
collection would either be tedious or have additional consequences
on semantics: with a conservative GC, some objects may never
get collected, with a precise GC, you have to declare GC roots
on the C level. Things get more complicated if the GC is also
compacting. See the current thread on the py3k list.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-20 Thread Josiah Carlson

Martin Devera [EMAIL PROTECTED] wrote:
[snip]
 Even if you can do fast atomic inc/dec, it forces cacheline with
 refcounter to ping-pong between caches of referencing cpus (for read only
 class dicts for example) so that you can probably never get good SMP
 scalability.

That's ok.  Why?  Because according to Guido, the GIL isn't going away:
http://mail.python.org/pipermail/python-3000/2006-April/001072.html
... so ruminations about refcounting, GC, etc., at least with regards to
removing the GIL towards some sort of free threading Python, are
likely to go nowhere.  Unless someone is able to translate the codebase
into using such methods, show how it is not (significantly) more
difficult to program extensions for, show a mild to moderate slowdown on
single processors, and prove actual speedup on multiple processors.  But
even then it will be a difficult sell, as it would require possibly
radical rewrites for all of the hundreds or thousands of CPython
extensions currently being developed and maintained.


 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-19 Thread Martin Devera
Greg Ewing wrote:
 Martin Devera wrote:
 
 As I've written in Big reader lock paragraph of the original 
 proposal, these
 objects could be handled by not blocking in read path
 
 But as was just pointed out, because of refcounting,
 there's really no such thing as read-only access to
 an object. What *looks* like read-only access at the
 Python level involves refcount updates just from the
 act of touching the object.
 

Yes I was thinking about atomic inc/dec (locked inc/dec in x86)
as used in G.Stein's patch.
I have to admit that I haven't measured its performance, I was
hoping for decent one. But from http://www.linuxjournal.com/article/6993
it seems that atomic inc is rather expensive too (75ns on 1.8GHz P4) :-(

Greg, what change do you have in mind regarding that 3 instruction
addition to refcounting ?
thanks, Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-19 Thread Martin Devera
 Ah, I think I understand now. First the minor critique: I believe
 the locking algorithm isn't thread-safe:
 
   while (ob-owner_thread != self_thread()) {
unlock_mutex(thread_mutex[self_thread()])
   // wait for owning thread to go to quiscent state
lock_mutex(thread_mutex[ob-owner_thread])
ob-owner_thread = self_thread()
unlock_mutex(thread_mutex[ob-owner_thread])
lock_mutex(thread_mutex[self_thread()])
   }
 
 If two threads are competing for the same object held by a third
 thread, they may simultaneously enter the while loop, and then
 simultaneously try to lock the owner_thread. Now, one will win,
 and own the object. Later, the other will gain the lock, and
 unconditionally overwrite ownership. This will cause two threads
 to own the objects, which is an error.

oops .. well it seems as very stupid error on my side. Yes you are
absolutely right, I'll have to rethink it. I hope it is possible
to do it in correct way...

 The more fundamental critique is: Why? It seems you do this
 to improve efficiency, (implicitly) claiming that it is
 more efficient to keep holding the lock, instead of releasing
 and re-acquiring it each time.
 
 I claim that this doesn't really matter: any reasonable
 mutex implementation will be fast if there is no lock
 contention. On locking, it will not invoke any system
 call if the lock is currently not held (but just
 atomically test-and-set some field of the lock); on
 unlocking, it will not invoke any system call if
 the wait list is empty. As you also need to test, there
 shouldn't be much of a performance difference.

I measured it. Lock op in futex based linux locking is of the same
speed as windows critical section and it is about 30 cycles on my
P4 1.8GHz in uncontented case.
As explained in already mentioned http://www.linuxjournal.com/article/6993
it seems due to pipeline flush during cmpxchg insn.
And there will be cacheline transfer penalty which is much larger. So
that mutex locking will take time comparable with protected code itself
(assuming fast code like dict/list read).
Single compare will take ten times less.
Am I missing something ?

thanks, Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-19 Thread Greg Ewing
Martin Devera wrote:

 Greg, what change do you have in mind regarding that 3 instruction
 addition to refcounting ?

I don't have any change in mind. If even an atomic inc
is too expensive, it seems there's no hope for us.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] deja-vu .. python locking

2006-09-18 Thread Martin Devera
Hello,

as someone has written in FAQ, sometimes someone starts a thread about
finer grained locking in Python.
Ok here is one.

I don't want to start a flamewar. I only seek suggestions and constructive
critic. I have some ideas whose are new in this context (I believe) and
I only wanted to make them public in case someone finds them interesting.

Comments are welcome.
Martin


Round 1, Greg Stein's patch
  The patch removes GIL from version 1.6 and replaces locking of
  list, dict and other structures with finer grained locking.
  The major slowdown seems to be in list and dict structures, dicts
  are used for object attributes and these are accessed quite often.

  Because (IIUC) mutex struct is quite heavy, dict and list are
  locked via pool of locks. When you lock this pooled lock you
  have to lock two locks in reality. One locks pool itself, and other
  locks the pooled lock (the second locking can be omited in non
  contended case because locks in the pool are in locked state).
  One lock take about 25 cycles on UP P4 (mainly pipeline flush
  during memory barrier) and can be even more expensive (hundreds
  of cycles) due to cacheline move between CPUs on SMP machine.
  Global pool lock is subject to cacheline pinpong as it will
  be often reacquired by competing CPUs.
  In mappinglookup there is lookmapping guarded by this locking scheme,
  lookmapping itself has about 20 cycles in the best (one hope typical) case
  plus compareobj cost (in case of string keys say ... 50..100 cycles?).
  Thus locking/unlocking the read takes 50..100 cycles and operation
  itself is 70-120 cycles.
  One might expect about 50% slowdown in dict read path.

RCU like locking
  Solution I have in mind is similar to RCU. In Python we have quiscent
  state - when a thread returns to main loop of interpreter.
  Let's add owner_thread field to locked object. It reflects last thread
  (its id) which called any lockable method on the object.
  Each LOCK operation looks like:
  while (ob-owner_thread != self_thread()) {
 unlock_mutex(thread_mutex[self_thread()])
// wait for owning thread to go to quiscent state
 lock_mutex(thread_mutex[ob-owner_thread])
 ob-owner_thread = self_thread()
 unlock_mutex(thread_mutex[ob-owner_thread])
 lock_mutex(thread_mutex[self_thread()])
  }
  Unlock is not done - we own the object now and can use it without locking
  (until we return to interpreter loop or we call LOCK on other object).
  For non-shared objects there is only penalty of ob-owner_thread != 
self_thread()
  condition. Not sure about Windows, but in recent Linuxes one can use %gs 
register
  as thread id, thus compare is about 3 cycles (and owner_thread should be
  in object's cacheline anyway).
  In contended case there is some cache pingpong with ob and mutex but it is
  as expected.

Deadlocks
  Our object ownership is long - from getting it in LOCK to next quiscent state
  of the thread. Thus when two threads want to step each on other's object, they
  will deadlock. Simple solution is to extend set of quiscent states.
  It is when thread releases its thread_mutex in main loop (and immediately
  reacquires). Additionaly it can release it just before it is going to wait
  on another thread's mutex, like in LOCK (already in code above). If you use
  LOCK correctly then when you are LOCKing an object you can't be in vulnerable
  part of OTHER object. So that let other threads to get ownership of your own
  objects in that time.
  One can also want to release his lock when going to lock mutex in threading
  package and in other places where GIL is released today.
  However I admit that I did no formal proof regarding deadlock, I plan
  to do it if nobody can find other flaw in the proposal.

Big reader lock
  While above scheme might work well, it'd impose performance penalty
  for shared dicts which are almost read only (module.__dict__).
  For these similar locking can be used, only writer has to wait until
  ALL other threads enter quiscent state (take locks of them), then perform
  change and unlock them all. Readers can read without any locking.

Compatibilty with 3rd party modules
  I've read this argument on pydev list. Maybe I'm not understanding something,
  but is it so complex for Py_InitModule4 to use extra flag in apiver for 
example ?
  When at least one non-freethreaded module is loaded, locking is done in
  old good way...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-18 Thread Martin v. Löwis
Martin Devera schrieb:
 RCU like locking
   Solution I have in mind is similar to RCU. In Python we have quiscent
   state - when a thread returns to main loop of interpreter.

There might be a terminology problem here. RCU is read-copy-update,
right? I fail to see the copy (copy data structure to be modified)
and update (replace original pointer with pointer to copy) part.
Do this play a role in that scheme? If so, what specific structure
is copied for, say, a list or a dict?

This confusion makes it very difficult for me to understand your
proposal, so I can't comment much on it. If you think it could work,
just go ahead and create an implementation.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-18 Thread Martin Devera
Martin v. Löwis wrote:
 Martin Devera schrieb:
 RCU like locking
   Solution I have in mind is similar to RCU. In Python we have quiscent
   state - when a thread returns to main loop of interpreter.
 
 There might be a terminology problem here. RCU is read-copy-update,
 right? I fail to see the copy (copy data structure to be modified)
 and update (replace original pointer with pointer to copy) part.
 Do this play a role in that scheme? If so, what specific structure
 is copied for, say, a list or a dict?
 
 This confusion makes it very difficult for me to understand your
 proposal, so I can't comment much on it. If you think it could work,
 just go ahead and create an implementation.

It is why I used a word similar. I see the similarity in a way to archieve
safe delete phase of RCU. Probably I selected bad title for the text. It
is because I was reading about RCU implementation in Linux kernel and
I discovered that the idea of postponing critical code to some safe point in
future might work in Python interpreter.

So that you are right. It is not RCU. It only uses similar technique as RCU
uses for free-ing old copy of data.

It is based on assumption that an object is typicaly used by single thread. You
must lock it anyway just for case if another thread steps on it. The idea is
that each object is owned by a thread. Owner can use its objects without
locking. If a thread wants to use foreign object then it has to wait for owning
thread to go to some safe place (out of interpreter, into LOCK of other 
object..).
It is done by per-thread lock and it is neccessary because owner does no 
locking,
thus you can be sure that nobody it using the object when former owner is 
somewhere
out of the object.

Regarding implementation, I wanted to look for some opinions before starting to
implement something as big as this patch. Probably someone can look and say, hey
it is stupit, you forgot that FILL_IN ... ;-)

I hope I explained it better this time, I know my English not the best. At least
worse than my Python :-)

thanks for your time, Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-18 Thread Jean-Paul Calderone
On Mon, 18 Sep 2006 17:06:47 +0200, Martin Devera [EMAIL PROTECTED] wrote:
Martin v. Löwis wrote:
 Martin Devera schrieb:
 RCU like locking
   Solution I have in mind is similar to RCU. In Python we have quiscent
   state - when a thread returns to main loop of interpreter.

 There might be a terminology problem here. RCU is read-copy-update,
 right? I fail to see the copy (copy data structure to be modified)
 and update (replace original pointer with pointer to copy) part.
 Do this play a role in that scheme? If so, what specific structure
 is copied for, say, a list or a dict?

 This confusion makes it very difficult for me to understand your
 proposal, so I can't comment much on it. If you think it could work,
 just go ahead and create an implementation.

It is why I used a word similar. I see the similarity in a way to archieve
safe delete phase of RCU. Probably I selected bad title for the text. It
is because I was reading about RCU implementation in Linux kernel and
I discovered that the idea of postponing critical code to some safe point in
future might work in Python interpreter.

So that you are right. It is not RCU. It only uses similar technique as RCU
uses for free-ing old copy of data.

It is based on assumption that an object is typicaly used by single thread. 

Which thread owns builtins?  Or module dictionaries?  If two threads are
running the same function and share no state except their globals, won't
they constantly be thrashing on the module dictionary?  Likewise, if the
same method is running in two different threads, won't they thrash on the
class dictionary?

Jean-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-18 Thread Martin Devera
 So that you are right. It is not RCU. It only uses similar technique as RCU
 uses for free-ing old copy of data.

 It is based on assumption that an object is typicaly used by single thread. 
 
 Which thread owns builtins?  Or module dictionaries?  If two threads are
 running the same function and share no state except their globals, won't
 they constantly be thrashing on the module dictionary?  Likewise, if the
 same method is running in two different threads, won't they thrash on the
 class dictionary?

As I've written in Big reader lock paragraph of the original proposal, these
objects could be handled by not blocking in read path and wait for all other
threads to come home before modifying.
The selection between locking mode could be selected either by something like
__locking__ or by detecting the mode.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-18 Thread Phillip J. Eby
At 07:08 PM 9/18/2006 +0200, Martin Devera wrote:
  So that you are right. It is not RCU. It only uses similar technique 
 as RCU
  uses for free-ing old copy of data.
 
  It is based on assumption that an object is typicaly used by single 
 thread.
 
  Which thread owns builtins?  Or module dictionaries?  If two threads are
  running the same function and share no state except their globals, won't
  they constantly be thrashing on the module dictionary?  Likewise, if the
  same method is running in two different threads, won't they thrash on the
  class dictionary?

As I've written in Big reader lock paragraph of the original proposal, these
objects could be handled by not blocking in read path and wait for all other
threads to come home before modifying.

Changing an object's reference count is modifying it, and most accesses to 
get the dictionaries themselves involve refcount changes.  Your plan, so 
far, does not appear to have any solution for reducing this overhead.

Module globals aren't so bad, in that you'd only have to lock and refcount 
when frames are created and destroyed.  But access to class dictionaries to 
obtain methods happens a lot more often, and refcounting is involved there 
as well.

So, I think for your plan to work, you would have to eliminate reference 
counting, in order to bring the lock overhead down to a manageable level.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-18 Thread Martin v. Löwis
Martin Devera schrieb:
 It is based on assumption that an object is typicaly used by single 
 thread. You must lock it anyway just for case if another thread steps
 on it. The idea is that each object is owned by a thread. Owner can
 use its objects without locking. If a thread wants to use foreign
 object then it has to wait for owning thread to go to some safe place
 (out of interpreter, into LOCK of other object..). It is done by
 per-thread lock and it is neccessary because owner does no locking, 
 thus you can be sure that nobody it using the object when former
 owner is somewhere out of the object.

Ah, I think I understand now. First the minor critique: I believe
the locking algorithm isn't thread-safe:

  while (ob-owner_thread != self_thread()) {
 unlock_mutex(thread_mutex[self_thread()])
// wait for owning thread to go to quiscent state
 lock_mutex(thread_mutex[ob-owner_thread])
 ob-owner_thread = self_thread()
 unlock_mutex(thread_mutex[ob-owner_thread])
 lock_mutex(thread_mutex[self_thread()])
  }

If two threads are competing for the same object held by a third
thread, they may simultaneously enter the while loop, and then
simultaneously try to lock the owner_thread. Now, one will win,
and own the object. Later, the other will gain the lock, and
unconditionally overwrite ownership. This will cause two threads
to own the objects, which is an error.

The more fundamental critique is: Why? It seems you do this
to improve efficiency, (implicitly) claiming that it is
more efficient to keep holding the lock, instead of releasing
and re-acquiring it each time.

I claim that this doesn't really matter: any reasonable
mutex implementation will be fast if there is no lock
contention. On locking, it will not invoke any system
call if the lock is currently not held (but just
atomically test-and-set some field of the lock); on
unlocking, it will not invoke any system call if
the wait list is empty. As you also need to test, there
shouldn't be much of a performance difference.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-18 Thread Greg Ewing
Martin Devera wrote:

 Regarding implementation, I wanted to look for some opinions before starting 
 to
 implement something as big as this patch. Probably someone can look and say, 
 hey
 it is stupit, you forgot that FILL_IN ... ;-)

If I understand correctly, your suggestion for avoiding
deadlock relies on the fact that a given thread can really
only have one object locked at a time, i.e. after you
LOCK an object you can only assume you own it until
you LOCK another object or return to some quiescent
state. Is this right?

If so, the question is whether it's sufficient to be
able to lock just one object at a time. Maybe it is,
but some more formal consideration of that might be
a good idea.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiem! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-18 Thread Greg Ewing
Martin Devera wrote:

 As I've written in Big reader lock paragraph of the original proposal, these
 objects could be handled by not blocking in read path

But as was just pointed out, because of refcounting,
there's really no such thing as read-only access to
an object. What *looks* like read-only access at the
Python level involves refcount updates just from the
act of touching the object.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiem! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] deja-vu .. python locking

2006-09-18 Thread Greg Ewing
Phillip J. Eby wrote:

 So, I think for your plan to work, you would have to eliminate reference 
 counting, in order to bring the lock overhead down to a manageable level.

There's a possibility it wouldn't be atrociously bad.
Seems like it would only add the 3 instructions or
whatever overhead to most refcount operations.

How much this would reduce performance depends on
what percentage of time is currently used by refcounting.
Are there any figures for that?

A quick way of getting an idea of how much effect
it would have might be to change Py_INCREF and
Py_DECREF to go through the relevant motions, and
see what timings are produced for single-threaded
code. It wouldn't be a working implementation, but
you'd find out pretty quickly if it were going to
be a disaster.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiem! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com