have two servers in AWS.  One is a live production server (a multi site 
WordPress installation with hundreds of sites and about 5,000 users) and 
the other is a clone of prod that is being configured for a test server. 
 The live one has four array servers, an Elastic Load Balancer and is 
connected to a large RDS in AWS.  And until yesterday, I naively thought 
our caching was being handled via APC and a WordPress plugin here and 
there.  But no.  Turns out someone here had added AWS's ElastiCache to our 
live server.  Essentially, ElastiCache is memcache for those not in the 
cloud.

Anyway, we tried to enable caching on our test server two days ago and it 
introduced a really strange bug (a redirect mysteriously appeared on our 
live site's main admin dashboard that then went to our test server).  So 
once we realized the bug was most likely related to a caching system we 
didn't know we had, we disabled caching.  As it turned out, when we enabled 
caching on our test server, it used the same Elasticache server our live 
server was using (because test was a clone of live).  So we disabled it 
when we removed/renamed the object-cache.php file.

Disabling it solved our redirect issue, but suddenly, many (not all) of our 
5,000 users could no longer log into their individual sites.  For some 
reason, the values that were in our database were not working for a good 
percentage of users, forcing them to have to reset their passwords instead. 
 Obviously, this is huge with 5,000 users in the mix.  So we reenabled 
caching on our live instance and decided to fix our cached redirect with WP 
configuration changes instead (we added define('RELOCATE',true); into the 
config to force the redirection to our test server to be overridden).  

One of the things we noticed with memcache was that it kept updating our 
wp_options table with the domain for the test server in place of our live 
one.  In fact, it's still doing it whenever I run a query to find the 
string for the test domain and update it to the live domain. Every few 
minutes, the caching changes it back. Scary. But it looks like our 
configuration change for now forces an override.  The really concerning 
thing about all this was the fact that it seems memcache is drawing from 
its own key:value pairs for the user passwords instead of directly from the 
database.  I mean with caching enabled, the users can get in.  Without it, 
many users are forced to reset their passwords.

Does anyone have any ideas for me as to how to effectively understand 
what's going on with memcache in this case and how to fix it so the 
database gets written to appropriately and so password info isn't just 
being held in the cache?  To my thinking it's a ticking time bomb.  All it 
would take is one flush_all command to make life very, very painful for 
most of my users.

We are on Nginx with MySQL on the RDS.

Reply via email to