Tres Seaver wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Paul Williams wrote:
Ok, here is what we have. I did a netstat on both machines, client and
server. The client sees and established connection and the server does
not. In the server log there is a disconnect. As far as hardware
between them, there is a switch (dell powerconnect 6024). Web Server
Directors might get hold of it but there are no hops on traceroute.
Traceroute only shows the client machine and the server machine.
So the client is just continuously polling the connection but getting
nothing back.
That sounds like some weird kernel / networking problem to me: I don't
see how Zope could be able to keep calling 'select' on a socket after
the other side has closed it.
We agree. This is a strange situation that none of us have seen before.
However, we have until tomorrow to do something and replacing hardware
is not feasable.
Is there any possibility that some kind of failover / IP takeover has
happened, such that the storage server now running is not the same host
/ instance as the one to shich the clients originally connected? Are
you using LVS + heartbeat, or some kind of hardware load balancer to
manage such redundancy?
We do have Web Services Directors that do load balancing, but in this
particular case, the storage server is not setup for load balancing, I
am not aware of any features that make the zodb capable of clustering
except for replication services offered through zope.
We are not sure whether the traffic is going to the Web Services
Directores or not. Even if it is, there are thousands of settings and
there is no-one available that knows what to change.
The storage server is a simple nas server with a static ip address.
What we are thinking about doing is changing the code in
zrpc/connection.py to close the connection in wait (line 638 zope
version 2.9.5) if the wait time gets too large or the poll has happened
too many times.
We are great at plone development, but have very little backend zope
development. Would someone please advise me as to whether this is going
to cause more problems?
According to the log message you posted earlier in the thread, your
appservers are spewing thousands of log messages from the connection's
'pending' method, although your deadlock debugger output shows the one
thread blocked on 'select' inside of the connection's 'wait' method.
There should be lots of log messages at TRACE level for the wait call,
including a doubling / backoff of the delay value from 1 mx to 1 sec.
Do you see those log messages, as well?
These messages are there. You can see the time doubling. This is where
we were thinking of breaking the connection once it gets to a certain
point and make zope reconnect.
This solves our hung connection problem, we think. However, I am hoping
someone can let me know if I am breaking something else by doing this.
Tres.
- --
===================================================================
Tres Seaver +1 540-429-0999 [EMAIL PROTECTED]
Palladion Software "Excellence by Design" http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFF5Dvr+gerLs4ltQ4RAm/HAKCUN5WboOxVGeB11GhEfgYQ3wos3QCdH0TW
DbcpXiMPlcQYyx0gewPFMLI=
=9A/a
-----END PGP SIGNATURE-----
_______________________________________________
Zope maillist - Zope@zope.org
http://mail.zope.org/mailman/listinfo/zope
** No cross posts or HTML encoding! **
(Related lists -
http://mail.zope.org/mailman/listinfo/zope-announce
http://mail.zope.org/mailman/listinfo/zope-dev )
_______________________________________________
Zope maillist - Zope@zope.org
http://mail.zope.org/mailman/listinfo/zope
** No cross posts or HTML encoding! **
(Related lists -
http://mail.zope.org/mailman/listinfo/zope-announce
http://mail.zope.org/mailman/listinfo/zope-dev )