I have completed a solution to this problem, which can be a drop-in update for the existing apr_memcache.c. It is now checked in for my module as http://code.google.com/p/modpagespeed/source/browse/trunk/src/third_party/aprutil/apr_memcache2.c .
It differs from the solution in https://issues.apache.org/bugzilla/show_bug.cgi?id=51065 in that: - It doesn't require an API change; it but it enforces the 50ms timeout that already exists for apr_multgetp for all operations. - It works under my load test (which I found is not true of the patch in 51065). For my own purposes, I will be shipping my module with apr_memcache2 so I get the behavior I want regardless of what version of Apache is installed. But I'd like to propose my patch for apr_memcache.c. The patch is attached, and I've also submitted it as an alternative patch to bug 51065. If you agree with the strategy I used to solve this problem, then please let me know if I can help with any changes required to get this into the main distribution, On Mon, Oct 22, 2012 at 5:21 PM, Joshua Marantz <jmara...@google.com> wrote: > I've had some preliminary success with my own variant of apr_memcache.c > (creatively called apr_memcache2.c). Rather than setting the socket > timeout, I've been mimicing the timeout strategy I saw in > apr_memcache_multgetp, by adding a new helper method: > > static apr_status_t wait_for_server_or_timeout(apr_pool_t* temp_pool, > apr_memcache2_conn_t* conn) > { > apr_pollset_t* pollset; > apr_status_t rv = apr_pollset_create(&pollset, 1, temp_pool, 0); > if (rv == APR_SUCCESS) { > apr_pollfd_t pollfd; > pollfd.desc_type = APR_POLL_SOCKET; > pollfd.reqevents = APR_POLLIN; > pollfd.p = temp_pool; > pollfd.desc.s = conn->sock; > pollfd.client_data = NULL; > apr_pollset_add(pollset, &pollfd); > apr_int32_t queries_recvd; > const apr_pollfd_t* activefds; > rv = apr_pollset_poll(pollset, MULT_GET_TIMEOUT, &queries_recvd, > &activefds); > if (rv == APR_SUCCESS) { > assert(queries_recvd == 1); > assert(activefds->desc.s == conn->sock); > assert(activefds->client_data == NULL); > } > } > return rv; > } > > And calling that before many of the existing calls to get_server_line as: > > rv = wait_for_server_or_timeout_no_pool(conn); > if (rv != APR_SUCCESS) { > ms_release_conn(ms, conn); > return rv; > } > > This is just an experiment; I think I can streamline this by > pre-populating the pollfd structure as part of the apr_memcache_conn_t > (actually now apr_memcache2_conn_t). > > I have two questions about this: > > 1. I noticed the version of apr_memcache.c that ships with Apache 2.4 is > somewhat different from the one that ships with Apache 2.2. In particular > the 2.4 version cannot be compiled against the headers that come with a 2.2 > distribution. Is there any downside to taking my hacked 2.2 apr_memcache.c > and running it in Apache 2.4? Or should I maintain two hacks? > > 2. This seems wasteful in terms of system calls. I am making an extra > call to poll, rather than relying on the socket timeout. The socket > timeout didn't work as well as this though. Does anyone have any theories > as to why, or what could be done to the patch in > https://issues.apache.org/bugzilla/show_bug.cgi?id=51065 to work? > > -Josh > > > On Fri, Oct 19, 2012 at 9:25 AM, Joshua Marantz <jmara...@google.com>wrote: > >> Following up: I tried doing what I suggested above: patching that change >> into my own copy of apr_memcache.c It was first of all a bad idea to pull >> in only part of apr_memcache.c because that file changed slightly between >> 2.2 and 2.4 and our module works in both. >> >> I was successful making my own version of apr_memcache (renaming >> entry-points apr_memcache2*) that I could hack. But if I changed the >> socket timeout from -1 to 10 seconds, then the system behaved very poorly >> under load test (though it worked fine in our unit-tests and system-tests). >> In other words, I think the proposed patch that Jeff pointed to above is >> not really working (as he predicted). This test was done without >> SIGSTOPing the memcached; it would timeout under our load anyway and >> thereafter behave poorly. >> >> I'm going to follow up on that bugzilla entry, but for now I'm going to >> pursue my own complicated mechanism of timing out the calls from my side. >> >> -Josh >> >> >> On Thu, Oct 18, 2012 at 10:46 AM, Joshua Marantz <jmara...@google.com>wrote: >> >>> Thanks Jeff, that is very helpful. We are considering a course of >>> action and before doing any work toward this, I'd like to understand the >>> pitfalls from people that understand Apache better than us. >>> >>> Here's our reality: we believe we need to incorporate memcached for >>> mod_pagespeed <http://modpagespeed.com> to scale effectively for very >>> large sites & hosting providers. We are fairly close (we think) to >>> releasing this functionality as beta. However, in such large sites, stuff >>> goes wrong: machines crash, power failure, fiber cut, etc. When it does we >>> want to fall back to serving partially unoptimized sites rather than >>> hanging the servers. >>> >>> I understand the realities of backward-compatible APIs. My expectation >>> is that this would take years to make it into an APR distribution we could >>> depend on. We want to deploy this functionality in weeks. The workarounds >>> we have tried backgrounding the apr_memcache calls in a thread and timing >>> out in mainline are complex and even once they work 100% will be very >>> unsatisfactory (resource leaks; Apache refusing to exit cleanly on >>> 'apachectl stop') if this happens more than (say) once a month. >>> >>> Our plan is to copy the patched implementation of >>> apr_memcache_server_connect and the static methods it calls into a new .c >>> file we will link into our module, naming the new entry-point something >>> else (apr_memcache_server_connect_with_timeout seems good). From a >>> CS/SE perspective this is offensive and we admit it, but from a product >>> quality perspective we believe this beats freezes and complicated/imperfect >>> workarounds with threads. >>> >>> So I have two questions for the Apache community: >>> >>> 1. What are the practical problems with this approach? Note that in >>> any case a new APR rev would require editing/ifdefing our code anyway, >>> so I >>> think immunity from APR updates such as this patch being applied is not a >>> distinguishing drawback. >>> 2. Is there an example of the correct solution to the technical >>> problem Jeff highlighted: "it is apparently missing a call to adjust >>> the socket timeout and to discard the connection if the timeout is >>> reached". >>> That sounds like a pattern that might be found elsewhere in the Apache >>> HTTPD code base. >>> >>> Thanks in advance for your help! >>> -Josh >>> >>> >>> On Wed, Oct 17, 2012 at 8:16 PM, Jeff Trawick <traw...@gmail.com> wrote: >>> >>>> On Wed, Oct 17, 2012 at 3:36 PM, Joshua Marantz <jmara...@google.com> >>>> wrote: >>>> > Is there a mechanism to time out individual operations? >>>> >>>> No, the socket connect timeout is hard-coded at 1 second and the >>>> socket I/O timeout is disabled. >>>> >>>> Bugzilla bug https://issues.apache.org/bugzilla/show_bug.cgi?id=51065 >>>> has a patch, though it is apparently missing a call to adjust the >>>> socket timeout and to discard the connection if the timeout is >>>> reached. More importantly, the API can only be changed in future APR >>>> 2.0; alternate, backwards-compatible API(s) could be added in future >>>> APR-Util 1.6. >>>> >>>> > >>>> > If memcached freezes, then it appears my calls to 'get' will block >>>> until >>>> > memcached wakes up. Is there any way to set a timeout for that call? >>>> > >>>> > I can repro this in my unit tests by sending a SIGSTOP to memcached >>>> before >>>> > doing a 'get'? >>>> > >>>> > Here are my observations: >>>> > >>>> > apr_memcache_multgetp seems to time out in bounded time if I SIGSTOP >>>> the >>>> > memcached process. Yes! >>>> > >>>> > apr_memcache_getp seems to hang indefinitely if I SIGSTOP the >>>> memcached >>>> > process. >>>> > >>>> > apr_memcache_set seems to hang indefinitely if I SIGSTOP the memcached >>>> > process. >>>> > >>>> > apr_memcache_delete seems to hang indefinitely if I SIGSTOP the >>>> memcached >>>> > process. >>>> > >>>> > apr_memcache_stats seems to hang indefinitely if I SIGSTOP the >>>> memcached >>>> > process. >>>> > >>>> > That last one really sucks as I am using that to print the status of >>>> all my >>>> > cache shards to the log file if I detected a problem :( >>>> > >>>> > >>>> > Why does apr_memcache_multgetp do what I want and not the others? >>>> Can I >>>> > induce the others to have reasonable timeout behavior? >>>> > >>>> > When I SIGSTOP memcached this makes it hard to even restart Apache, at >>>> > least with graceful-stop. >>>> > >>>> > >>>> > On a related note, the apr_memcache >>>> > documentation< >>>> http://apr.apache.org/docs/apr-util/1.4/group___a_p_r___util___m_c.html >>>> >is >>>> > very thin. I'd be happy to augment it with my observations on its >>>> > usage >>>> > and the meaning of some of the arguments if that was desired. How >>>> would I >>>> > go about that? >>>> >>>> Check out APR trunk from Subversion, adjust the doxygen docs in the >>>> include files, build (make dox) and inspect the results, submit a >>>> patch to d...@apr.apache.org. >>>> >>>> > >>>> > -Josh >>>> >>>> >>>> >>>> -- >>>> Born in Roswell... married an alien... >>>> http://emptyhammock.com/ >>>> >>> >>> >> >