On Sat, 21 Jan 2006, Todd Maynard wrote:
> I want to thank you for ruining my plans for a relaxing Saturday > morning. As a thread newbie I killed several hours playing around with > your code. Hi Todd, Sorry about that. I hope you were relaxing in a cafe while playing with the code. > One thing I noticed is that sometimes the program would hang, which I > figured was the Queue code blocking in the Ticket claim function. I used > exception handling to deal with that situation cleanly. That's odd. There shouldn't be anything that blocks the code. Oh! Did you make changes to the test code, or did the hanging occur in the original code in: http://mail.python.org/pipermail/tutor/2006-January/044567.html I'm curious because nothing there should fundamentally block, assuming that _doJob() doesn't dies badly with an exception. If _doJob() dies, the server dies, and that's bad. *grin* Do you mind showing what the exception handling looks like in your code? > I then decided that it wasn't very nice of Starbucks to close after > accepting my order without giving me my Latte, so I changed that part of > the code to: [code cut] > I am 99.44% sure that this is thread safe, reasoning being: > setting the acceptNew to False and adding the QUIT_NOW happens in the > same > thread so it is impossible for another job to get scheduled after the > QUIT_NOW - so no thread will end up hanging... Bad news: no. *grin* There's a "race condition". Let's go into some detail with this, since this is not obvious stuff. First, let's look at the code again --- I'll label three lines with (a), (b), and (c), to make it a little easier to see the race. ############################################################### def schedule(self,job): if self.acceptNew == True: ## (a) outputQueue=Queue() self.queue.put((job,outputQueue)) return Ticket(outputQueue) else: print "Server not accepting any new requests." return None def scheduleShutdown(self): ## (b) self.queue.put((Server._QUIT_NICELY,None)) def _jobLoop(self): while True: print "Looping ... " (nextJob, outputQueue) = self.queue.get() if nextJob is server._QUIT_NOW: return if nextJob is server._QUIT_NICELY: ## (c) self.acceptNew = False self.queue.put((Server._QUIT_NOW,None)) else: returnValue=self._doJob(nextJob) outputQueue.put(returnValue) ############################################################## Let's imagine three threads, which I'll name C1, C2, and S. C1 and C2 will be distinct client threads, and S will be the server thread that runs through _jobLoop(). Imagine the following scenario. The server's online, and its work queue is empty. 1. C1 calls schedule(), and reaches the line labeled (a). At this point, server.acceptNew is True, so it goes into the body of the if statement. But wait... 2. Now we context switch to C2. C2 calls scheduleShutdown() in its entirety. There is now a _QUIT_NICELY element in the queue. C2 is done for. 3. Now we context switch to the server thread S. It grabs the _QUIT_NICELY, and puts a _QUIT_NOW. Let's imagine that S continues and loops again. In the next loop through _jobLoop(), it sees _QUIT_NOW and exits. S is done for. Muhahaha. 4. Now we context switch back to C1 and continue with: outputQueue = Queue() self.queue.put((job,outputQueue)) return Ticket(outputQueue) In this scenario, poor C1 is left holding a ticket that will never cash out. One way to fix this problem is to make calling schedule() and scheduleShutdown() "atomic" in this sense: if we're calling schedule(), we shouldn't be able to context switch into a call to scheduleShutdown(), and visa-versa. Our troubles started at step 2 of the above scenario, where two clients jostled for attention. If we prevent that particular situation --- if we force all our clients to stand in line to get served --- then we'll be fine. So we might look into some synchronizing tool, like a Lock object: http://www.python.org/doc/lib/lock-objects.html Concretely, we can add an exclusive Lock object to the server's __init__: def __init__(self): self.clientLock = Lock() and make sure to acquire-and-release in any of our schedule* functions: def schedule*(self, job): self.clientLock.acquire() try: ... finally: self.clientLock.release() > However, I would sleep a little better if you could reassure me that I > am right, and would sleep even better if you could give me a method to > test this. I'm sorry; I can't provide either. That doesn't mean that such things don't exist, but only that I don't know about them. (The only formal training I've received on this, so far, has been a standard intro Operating Systems CS course.) So you might want to check with others. The way I caught the race condition above was by just by trying to be in a very foul mood while reading the code. > This kinda stuff looks tricky to test with standard unittest > methodology.... It might be hard. It's an active CS research topic to formally check for race conditions. There may be commercial tools to build unit tests to hammer for race conditions, but I don't know much about them. For example, associate Professor Cormac Flanagan does research on statically detecting race conditions: http://www.soe.ucsc.edu/~cormac/ And Google Scholar does show a heck of a lot of recent papers on this: http://scholar.google.com/scholar?hl=en&lr=&q=thread+race+condition&btnG=Search so this might not be a closed issue yet. Hmmm... hey, Flanagan's paper looks interesting! Ok, I'm printing it out now... now where's my coffee... Weekend reading for me. Well, thanks for ruining my weekend too. *grin* _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor