[Bug 931696] [NEW] Server timeout state reset on IO error

Trevor North Mon, 13 Feb 2012 13:15:48 -0800

Public bug reported:

When an IO error is encountered the server state is reset to
MEMCACHED_SERVER_STATE_NEW even if it is currently
MEMCACHED_SERVER_STATE_IN_TIMEOUT.  The call to
memcached_mark_server_for_timeout will then incorrectly push the next
connection retry time further back and further increment the server
failure counter.  This throws out the connection back-off handling as it
appears there has been another failure when in fact we're just dealing
with an in-progress failure so to speak.


This may only manifest itself as a problem when using consistent
distribution due to the point at which the continuum is recalculated - I
haven't tested with any of the other distribution options.  It's
probably also more obviously a problem when making use of the dead
server retry behaviour included in 1.0.3+. In a nutshell it should be
possible to observe that retries do not occur at the expected intervals
and failure counts are not accurate after a server in the pool is taken
offline.

I patched io.cc and quit.cc to work around this as part of the following
commit to my branch: http://bazaar.launchpad.net/~trevor/libmemcached
/dead-retry/revision/978

This may well be fixing the symptom rather than the cause, but I have
had the change running in production for quite some time now with no
apparent side-effects. I do understand that those changes cause at least
some of the tests to fail though which certainly warrants further
investigation.

I've been meaning to find the time to put together a proper example test
case and results for this but that has been proving impossible of late.
I still wanted to get the issue logged though - please let me know if
I've not been clear enough here or can provide any more useful
information.

** Affects: libmemcached (Ubuntu)
     Importance: Undecided
         Status: New

** Description changed:

  When an IO error is encountered the server state is reset to
  MEMCACHED_SERVER_STATE_NEW even if it is currently
  MEMCACHED_SERVER_STATE_IN_TIMEOUT.  The call to
  memcached_mark_server_for_timeout will then incorrectly push the next
- connection retry time further back and increment the server failure
- counter to be further incremented.  This throws out the connection back-
- off handling as it appears there has been another failure when in fact
- we're just dealing with an in-progress failure so to speak.
+ connection retry time further back and further increment the server
+ failure counter.  This throws out the connection back-off handling as it
+ appears there has been another failure when in fact we're just dealing
+ with an in-progress failure so to speak.
  
  This may only manifest itself as a problem when using consistent
  distribution due to the point at which the continuum is recalculated - I
  haven't tested with any of the other distribution options.  It's
  probably also more obviously a problem when making use of the dead
  server retry behaviour included in 1.0.3+. In a nutshell it should be
  possible to observe that retries do not occur at the expected intervals
  and failure counts are not accurate after a server in the pool is taken
  offline.
  
  I patched io.cc and quit.cc to work around this as part of the following
  commit to my branch: http://bazaar.launchpad.net/~trevor/libmemcached
  /dead-retry/revision/978
  
  This may well be fixing the symptom rather than the cause, but I have
  had the change running in production for quite some time now with no
  apparent side-effects. I do understand that those changes cause at least
  some of the tests to fail though which certainly warrants further
  investigation.
  
  I've been meaning to find the time to put together a proper example test
  case and results for this but that has been proving impossible of late.
  I still wanted to get the issue logged though - please let me know if
  I've not been clear enough here or can provide any more useful
  information.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/931696

Title:
  Server timeout state reset on IO error

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/libmemcached/+bug/931696/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 931696] [NEW] Server timeout state reset on IO error

Reply via email to