It's starting to look a lot like the Windows bind() implementation is unreliable, sometimes (but rarely -- hard to provoke) allowing two sockets to bind to the same (address, port) pair simultaneously, instead of raising 'Address already in use' for one of them. Disaster ensues.
WRT the last version of the code I posted, on another XP Pro SP2 machine (again after playing registry games to boost the number of ephemeral ports) I eventually saw all of: hangs during accept(); the assertion errors I mentioned last time; and mystery "Connection refused" errors during connect(). The variant of the code below _only_ tries to use port 19999. If it can't bind to that on the first try, socktest111() raises an exception instead of trying again (or trying a different port number). Ran two processes. After about 15 minutes, both died with assert errors at about the same time (identical, so far as I could tell by eyeball): Process A: Traceback (most recent call last): File "socktest.py", line 209, in ? assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) AssertionError: ('292739', '821744', ('127.0.0.1', 19999), ('127.0.0.1', 3845)) Process B: Traceback (most recent call last): File "socktest.py", line 209, in ? assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) AssertionError: ('821744', '292739', ('127.0.0.1', 19999), ('127.0.0.1', 3846)) So it's again the business where each process is recv'ing the random string intended to be recv'ed by a socket in the other process. Hypothesized timeline: process A's `a` binds to 19999 process B's `a` binds to 19999 -- according to me, this should be impossible in the absence of SO_REUSEADDR (which acts very differently on Windows than it does on Linux, BTW -- on Linux this should be impossible even in the presence of SO_REUSEADDR; regardless, we're not using SO_REUSEADDR here, and the braindead hard-coded w.setsockopt(socket.IPPROTO_TCP, 1, 1) is actually using the right magic constant for TCP_NODELAY on Windows, as it intends). A and B both listen() A connect()s, and accidentally gets on B.a's accept queue B connect()s, and accidentally gets on A.a's accept queue the rest follows inexorably Note that because this never tries a port number other than 19999, it can't be a bulletproof workaround simply to hold on to the `a` socket. If the hypothesized timeline above is right, bind() can't be trusted on Windows in any situation where two processes may try to bind to the same hostname:port pair at the same time. Holding on to `a`, and cycling through port numbers when bind() failed, would still potentially leave two processes trying to bind to the same port number simultaneously (just a port other than 19999). Ick: this happens under Pythons 2.3.5 (MSVC 6) and 2.4.1 (MSVC 7.1), so if it is -- as is looking more and more likely --an error in MS's socket implementation, it isn't avoided by switching to a newer MS C library. Frankly, I don't see a sane way to worm around this -- it's difficult for application code to worm around what smells like a missing critical section in system code. Using the simpler socket dance from the ZODB 3.4 code, I haven't yet seen an instance of the assert failure, or a hang. However, let two processes run that long enough simultaneously, and it always (so far) eventually fails with socket.error: (10048, 'Address already in use') in the w.connect() call, and despite that Windows picks the port numbers here! While that also smells to heaven of a missing critical section in the Windows socket implementation, an exception is much easier to live with / worm around. Alas, we don't have the MS source code, and I don't have time to try disassembling / reverse-engineering the opcodes (what EULA <wink>?), so best I can do is run this for many more hours to try to increase confidence that an exception is the worst that can occur under the ZODB 3.4 spelling. Here's full code for the "only try port 19999" version: import socket, errno import time, random def socktest111(): """Raise an exception if we can't get 19999. """ a = socket.socket (socket.AF_INET, socket.SOCK_STREAM) w = socket.socket (socket.AF_INET, socket.SOCK_STREAM) # set TCP_NODELAY to true to avoid buffering w.setsockopt(socket.IPPROTO_TCP, 1, 1) # tricky: get a pair of connected sockets host = '127.0.0.1' port = 19999 try: a.bind((host, port)) except: raise RuntimeError else: print 'b', a.listen (1) w.setblocking (0) try: w.connect ((host, port)) except: pass print 'c', r, addr = a.accept() print 'a', a.close() print 'c', w.setblocking (1) return (r, w) sofar = [] try: while 1: try: stuff = socktest111() except RuntimeError: print 'x', time.sleep(random.random()/10) continue sofar.append(stuff) time.sleep(random.random()/10) if len(sofar) == 50: tup = sofar.pop(0) r, w = tup msg = str(random.randrange(1000000)) w.send(msg) msg2 = r.recv(100) assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname()) for s in tup: s.close() except KeyboardInterrupt: for tup in sofar: for s in tup: s.close() _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )