While upgrading to the latest cURL 7.54.1, one of our own internal tests
was failing while exercising the new version. It seems that there are
some bugs in how timers are managed for a single easy handle that causes
the wrong "next timeout" value to be reported to the application when a
new minimum needs to be recomputed and that new minimum should be an
existing timer that isn't currently set for the easy handle. So, when
the application is driving a set of easy handles via the
`curl_multi_socket_action()` API (for example), it gets told to wait the
wrong amount of time before the next call, which causes requests to
linger for a long time (or, it is my guess, possibly forever).
From its symptoms, this issue is very similar to an issue I previously
reported on an older version:
https://curl.haxx.se/mail/lib-2014-02/0038.html The symptoms are
similar enough that the test program that demonstrates the bug
(attached) is almost identical. However, the root cause is different.
Note that the symptoms are very timing sensitive: if events and timers
are just so, the bug doesn't manifest itself. The faster your computer
and connection, the less likely you are to notice the problem, it seems.
From inspection of the code, it seems that cURL tries to keep a sorted
linked list of the outstanding (or expired) timers for an easy handle,
and then additionally a splay tree in the multi of the minimum timer
from each such linked list. Thus, it can easily find the soonest next
timer expiry time point and use it to tell the application when to
schedule the next timer callback into cURL. This approach makes sense to me.
The trouble comes in how cURL maintains the invariants of the list and
splay tree. Although undocumented, it seems that the current code
(7.54.1 and master) tries to keep the "most recent" timer for each easy
handle _only_ in the splay tree and not in the linked list.
Unfortunately, the splay tree entry for such a timer does not contain
enough information to properly recompute the minimum timer for an easy
handle in all circumstances. And, the splay tree entry doesn't seem to
be properly considered when adding new timer instances. There seem to
be a few subtle bugs in the code that manages the relationship between
the splay tree and linked lists.
The attached patch changes the invariant such that every non-expired
timer must always exist in the easy handle's linked list. The minimum
such timer value still exists in the splay tree. The code to maintain
this invariant seems to be simpler, and seems to avoid all the bugs in
the original. In summary:
* When a timer is seen to expire by `Curl_splaygetbest()`, the new splay
entry is now left in the linked list by `add_next_timeout()`.
* When a timer is scheduled by `Curl_expire()`, an entry is always left
in the linked list.
* Also, in `Curl_expire()`, the old code used to assume the splay tree
entry value (as present in `expiretime`) was for the same `id` as was
being updated and would sometimes copy it into the linked list as if it
were, even though this assumption does not seem to be true. (The call
to `multi_addtimeout()` was passing `nowp` instead of `set` and
presuming `nowp` had something to do with `id`.) This seems like it
could lead to confusion of what timers are set for what values and loss
of scheduled timers, or worse. The patch avoids this.
There seem to be no bad side effects to keeping all these timers in the
linked lists. Doing so makes the computation of the next minimum timer
trivial. The expired timer values continue to be removed in the same
way as before.
The `output-unpatched.txt` and `output-patched.txt` attachments show the
output of the `test.c` program that demonstrates the bug. In the
unpatched output, note how the 2000 ms timeout is ignored (because it's
been forgotten since it isn't kept in the linked list) and so the
program is told to wait far too long. (Inspecting the program state
here reveals that `Curl_timeleft()` has meanwhile properly computed the
remaining time for the request; it's just the splay tree and linked list
giving the wrong answer, so cURL doesn't tell application the correct
timeout value in the `CURLMOPT_TIMERFUNCTION`.)
The patch applies against 7.54.1. From inspection (only), it seems the
bug is present in older versions and in the current master version on
github.
I hope this patch or something derived from it can be applied.
I also have a few observations and questions that came up during the
investigation of this bug which may or may help. Please take them as
they are intended as helpful ideas (and not criticisms):
1. After this patch, it seems like it might be worth replacing the easy
handle's `expiretime` value with a boolean indicating whether or not the
easy is represented in the splay tree. (Each easy's representative time
should always be that of the minimum timer in the list, I think.) This
patch does not attempt to make this change.
2. It seems odd that once an easy handle completes, it keeps timers in
its list and can contribute to the splay tree. Perhaps it would be best
to clear all timers from the list and splay tree when the easy handle
completes. This patch does not attempt to make this change.
3. It seems odd that once a DNS request completes (for example), its
timer (as scheduled by `Curl_expire()`) remains in the linked list.
(With the c-ares resolver, a non-configurable 300 second timer is
scheduled in what appears to be a safety mechanism, but it never gets
removed even when the DNS result arrives.) With this change (and debug
dumps of the linked list while experimenting) it became much more
obvious that this timer stays around. (This is, in fact, the source of
the ~299 second timer in the unpatched output from the test program.
Interestingly, if this timer _had_ been cleared, the test program would
likely never have terminated.) This patch does not attempt to change this.
--
Brad Spencer
diff -ur curl-7.54.1.pristine/lib/multi.c curl-7.54.1/lib/multi.c
--- curl-7.54.1.pristine/lib/multi.c 2017-06-09 06:57:41.000000000 -0300
+++ curl-7.54.1/lib/multi.c 2017-07-05 12:42:21.934875623 -0300
@@ -2520,10 +2520,8 @@
/* copy the first entry to 'tv' */
memcpy(tv, &node->time, sizeof(*tv));
- /* remove first entry from list */
- Curl_llist_remove(list, e, NULL);
-
- /* insert this node again into the splay */
+ /* Insert this node again into the splay. Keep the timer in the list in
+ case we need to recompute future timers. */
multi->timetree = Curl_splayinsert(*tv, multi->timetree,
&d->state.timenode);
}
@@ -2940,26 +2938,25 @@
set.tv_usec -= 1000000;
}
+ /* Remove any timer with the same id just in case. */
+ multi_deltimeout(data, id);
+
+ /* Add it to the timer list. It must stay in the list until it has expired
+ in case we need to recompute the minimum timer later. */
+ multi_addtimeout(data, &set, id);
+
if(nowp->tv_sec || nowp->tv_usec) {
/* This means that the struct is added as a node in the splay tree.
Compare if the new time is earlier, and only remove-old/add-new if it
is. */
time_t diff = curlx_tvdiff(set, *nowp);
- /* remove the previous timer first, if there */
- multi_deltimeout(data, id);
-
if(diff > 0) {
- /* the new expire time was later so just add it to the queue
- and get out */
- multi_addtimeout(data, &set, id);
+ /* The current splay tree entry is sooner than this new expiry time.
+ We don't need to update our splay tree entry. */
return;
}
-
- /* the new time is newer than the presently set one, so add the current
- to the queue and update the head */
- multi_addtimeout(data, nowp, id);
-
+
/* Since this is an updated time, we must remove the previous entry from
the splay tree first and then re-add the new value */
rc = Curl_splayremovebyaddr(multi->timetree,
@@ -2969,10 +2966,13 @@
infof(data, "Internal error removing splay node = %d\n", rc);
}
+ /* Indicate that we are in the splay tree and insert the new timer expiry
+ value since it is our local minimum. */
*nowp = set;
data->state.timenode.payload = data;
multi->timetree = Curl_splayinsert(*nowp, multi->timetree,
&data->state.timenode);
+ return;
}
/*
Only in curl-7.54.1/lib: multi.c~
Timeout changed to 0 ms
Slowing things down by sleeping 500 ms
poll() on 0 fds for 0 ms
call curl_multi_socket_action() for timeout
Add fd=3 events=1
Timeout changed to 0 ms
Slowing things down by sleeping 500 ms
poll() on 1 fds for 0 ms
call curl_multi_socket_action() fd=3 revents=1
Remove fd=3
* Trying 1.2.3.4...
* TCP_NODELAY set
* Connected to host.example.org (1.2.3.4) port 80 (#0)
* Server auth using Digest with user 'user'
> GET /test/digest-auth?timeout=1000 HTTP/1.1
Host: host.example.org
Accept: */*
Add fd=3 events=1
Timeout changed to 199 ms
Slowing things down by sleeping 500 ms
poll() on 1 fds for 199 ms
call curl_multi_socket_action() fd=3 revents=1
< HTTP/1.1 401 Unauthorized
< Date: Wed, 05 Jul 2017 16:14:30 GMT
< Server: Apache
< WWW-Authenticate: Digest realm="test", qop="auth", nonce="thisISaNONCE"
< X-This-Is-The-401: yes
< Content-Length: 17
<
* Ignoring the response-body
* Connection #0 to host host.example.org left intact
* Issue another request to this URL:
'http://host.example.org/test/digest-auth?timeout=1000'
* Found bundle for host host.example.org: 0x1cd92f0 [can pipeline]
* Re-using existing connection! (#0) with host host.example.org
* Connected to host.example.org (1.2.3.4) port 80 (#0)
* Server auth using Digest with user 'user'
> GET /test/digest-auth?timeout=1000 HTTP/1.1
Host: host.example.org
Authorization: Digest username="user", realm="test", nonce="thisISaNONCE",
uri="/test/digest-auth?timeout=1000",
cnonce="Mjc5MWQwYjMyMGM1M2EzNzQ1MTI0ZWMzNWVjNjNmNTc=", nc=00000001, qop=auth,
response="88a45913497e6bca299c23a497e28b04"
Accept: */*
Timeout changed to 999 ms
poll() on 1 fds for 999 ms
call curl_multi_socket_action() for timeout
* Operation timed out after 1000 milliseconds with 0 bytes received
* Closing connection 0
Remove fd=3
Timeout changed to 298499 ms
Timeout changed to -1 ms
Slowing things down by sleeping 500 ms
Timeout changed to 0 ms
Slowing things down by sleeping 500 ms
poll() on 0 fds for 0 ms
call curl_multi_socket_action() for timeout
Add fd=3 events=1
Timeout changed to 0 ms
Slowing things down by sleeping 500 ms
poll() on 1 fds for 0 ms
call curl_multi_socket_action() fd=3 revents=1
Remove fd=3
* Trying 1.2.3.4...
* TCP_NODELAY set
* Connected to host.example.org (1.2.3.4) port 80 (#0)
* Server auth using Digest with user 'user'
> GET /test/digest-auth?timeout=1000 HTTP/1.1
Host: host.example.org
Accept: */*
Add fd=3 events=1
Timeout changed to 199 ms
Slowing things down by sleeping 500 ms
poll() on 1 fds for 199 ms
call curl_multi_socket_action() fd=3 revents=1
< HTTP/1.1 401 Unauthorized
< Date: Wed, 05 Jul 2017 16:12:26 GMT
< Server: Apache
< WWW-Authenticate: Digest realm="test", qop="auth", nonce="thisISaNONCE"
< X-This-Is-The-401: yes
< Content-Length: 17
<
* Ignoring the response-body
* Connection #0 to host host.example.org left intact
* Issue another request to this URL:
'http://host.example.org/test/digest-auth?timeout=1000'
* Found bundle for host host.example.org: 0xed32f0 [can pipeline]
* Re-using existing connection! (#0) with host host.example.org
* Connected to host.example.org (1.2.3.4) port 80 (#0)
* Server auth using Digest with user 'user'
> GET /test/digest-auth?timeout=1000 HTTP/1.1
Host: host.example.org
Authorization: Digest username="user", realm="test", nonce="thisISaNONCE",
uri="/test/digest-auth?timeout=1000",
cnonce="ZTU5MjZjMzFjYmFiYzhlMjg1MmVmZjFmZmNkYzZhNzA=", nc=00000001, qop=auth,
response="4cd8b67c887fcf90080544de6538d1cf"
Accept: */*
Timeout changed to 299498 ms
poll() on 1 fds for 299498 ms
/* This test case is based on several cURL example programs, adapted to show
the "no timeout" issue by using the curl_multi_socket_action() interface.
It's intended to be as simple as possible to show the issue, not to be a
robust or efficient implementation.
This program expects to connect to the following CGI script running under
Apache 2.4:
#!/usr/bin/perl -t
#
# Iff $QUERY_STRING is of the form "timeout=N", where N is a decimal integer,
# then when an Authorization headers exists in the request, this will sleep
# N seconds before returning the 200 OK response.
#
use strict;
use warnings;
use MIME::Base64;
# Do we have an Authorization header?
my $header = $ENV{HTTP_AUTHORIZATION} || '';
if($header eq '') {
# No, so issue a challenge.
print("Status: 401\r\n");
print("WWW-Authenticate: Digest ",
"realm=\"test\", ",
"qop=\"auth\", ",
"nonce=\"thisISaNONCE\"\r\n");
print("X-This-Is-The-401: yes\r\n");
print("\r\n");
# When configured to do digest authentication, our cURL request should never
# expose this entity via the API.
print("This is the 401.\n");
exit(0);
}
# Yes, they have passed the AUTHORIZATION header.
#
# If there's a timeout, wait for it.
my $request = $ENV{QUERY_STRING} || '';
if($request =~ /^timeout=(\d+)$/) {
sleep($1);
}
# Echo their header back to them.
print("Content-Type: text/plain\r\n\r\n$header");
exit(0);
The following Apache configuration snippet allows the CGI to run and access
to the Authorization header:
<Directory /var/www/test/>
AllowOverride None
Options ExecCGI -MultiViews +SymLinksIfOwnerMatch
Require all granted
# Run all of the files as CGIs.
<Files "*">
SetHandler cgi-script
</Files>
# Expose the Authorization header to the CGIs.
RewriteEngine on
RewriteCond %{HTTP:Authorization} ^(.*)
RewriteRule .* - [e=HTTP_AUTHORIZATION:%1]
</Directory>
*/
#include <stdio.h>
#include <string.h>
#include <poll.h>
#include <assert.h>
#include <errno.h>
#include <curl/curl.h>
/* The URL of the test CGI script, minus query arguments. */
#define TEST_URL "http://host.example.org/test/digest-auth"
/* Enough room for some test fds */
struct pollfd fds[100];
/* The number of fds we are using in the array. */
int fd_count = 0;
/* Called when cURL changes the fd list */
static int sock_cb(CURL *e, curl_socket_t s, int what, void *cbp, void *sockp)
{
int i;
/* Convert events */
short events = 0;
switch(what) {
case CURL_POLL_NONE: events = 0; break;
case CURL_POLL_IN: events = POLLIN; break;
case CURL_POLL_OUT: events = POLLOUT; break;
case CURL_POLL_INOUT: events = POLLIN | POLLOUT; break;
case CURL_POLL_REMOVE: /* We'll remove */ break;
}
/* Find */
for(i = 0; i < fd_count; ++i) {
if(fds[i].fd == s) {
/* Remove? */
if(what == CURL_POLL_REMOVE) {
printf("Remove fd=%d\n", s);
memcpy(fds + i + 1, fds + i, sizeof(fds[0]) * fd_count - i - 1);
--fd_count;
return 0;
}
/* Change */
printf("Change fd=%d events=%x\n", s, events);
fds[i].events = events;
return 0;
}
}
/* Add */
assert((size_t)i < sizeof(fds) / sizeof(fds[0]));
printf("Add fd=%d events=%x\n", s, events);
fds[i].fd = s;
fds[i].events = events;
++fd_count;
return 0;
}
/* poll() timeout */
long timeout = -1;
/* Called when cURL changes the timeout */
static int multi_timer_cb(CURLM *multi, long timeout_ms, void *ignored)
{
timeout = timeout_ms;
printf("Timeout changed to %ld ms\n", timeout);
if(timeout_ms < 200)
{
/* Simulate some delay so that the timer management in cURL really starts to
matter. */
printf("Slowing things down by sleeping 500 ms\n");
poll(NULL, 0, 500);
}
return 0;
}
/*
* Demonstrate the issue.
*/
int main(void)
{
CURL *http_handle;
CURLM *multi_handle;
int remaining;
curl_global_init(CURL_GLOBAL_DEFAULT);
http_handle = curl_easy_init();
/* Create the test request for digest auth. The server will respond with a
401 and cURL will retry with the credentials but then the server will not
respond to that second request (for 1000 seconds) to demonstrate that
cURL doesn't detect the timeout. */
curl_easy_setopt(http_handle, CURLOPT_URL, TEST_URL "?timeout=1000");
curl_easy_setopt(http_handle, CURLOPT_VERBOSE, 1L);
curl_easy_setopt(http_handle, CURLOPT_USERNAME, "user");
curl_easy_setopt(http_handle, CURLOPT_PASSWORD, "password");
curl_easy_setopt(http_handle, CURLOPT_HTTPAUTH, CURLAUTH_DIGEST);
/* Set a short (but not too short) timeout to demonstrate the error. */
curl_easy_setopt(http_handle, CURLOPT_TIMEOUT_MS, 2000);
/* Create the multi */
multi_handle = curl_multi_init();
/* Set up for curl_multi_socket_action() use */
curl_multi_setopt(multi_handle, CURLMOPT_SOCKETFUNCTION, sock_cb);
curl_multi_setopt(multi_handle, CURLMOPT_TIMERFUNCTION, multi_timer_cb);
/* add the individual transfers */
curl_multi_add_handle(multi_handle, http_handle);
/* loop */
do {
int i;
const int actual_fd_count = fd_count;
struct pollfd actual[100];
/* To keep the example logic simple, just copy everything before we call
poll() so the callbacks can change the array however they want. */
memcpy(actual, fds, sizeof(fds[0]) * fd_count);
printf("poll() on %d fds for %ld ms\n", actual_fd_count, timeout);
const int result = poll(actual, actual_fd_count, timeout);
if(result == -1 && errno != EINTR) {
printf("poll() fails: %s\n", strerror(errno));
return 1;
}
/* dispatch timeout */
if(result == 0) {
printf("call curl_multi_socket_action() for timeout\n");
const CURLMcode result =
curl_multi_socket_action(multi_handle, CURL_SOCKET_TIMEOUT, 0,
&remaining);
if(result != CURLM_OK) {
printf("curl_multi_socket_action() timeout failed: %d\n", result);
return 2;
}
}
/* dispatch events */
else {
for(i = 0; i < actual_fd_count; ++i) {
if(actual[i].revents) {
const int fd = actual[i].fd;
int mask = 0;
if(actual[i].revents & POLLIN) {
mask |= CURL_CSELECT_IN;
}
if(actual[i].revents & POLLOUT) {
mask |= CURL_CSELECT_OUT;
}
if(actual[i].revents & (POLLHUP | POLLERR)) {
mask |= CURL_CSELECT_ERR;
}
printf("call curl_multi_socket_action() fd=%d revents=%x\n", fd,
actual[i].revents);
const CURLMcode result =
curl_multi_socket_action(multi_handle, fd, mask, &remaining);
if(result != CURLM_OK) {
printf("curl_multi_socket_action() fd=%d failed: %d\n", fd, result);
return 3;
}
}
}
}
} while(remaining);
curl_multi_remove_handle(multi_handle, http_handle);
curl_easy_cleanup(http_handle);
curl_multi_cleanup(multi_handle);
curl_global_cleanup();
return 0;
}
-------------------------------------------------------------------
Unsubscribe: https://cool.haxx.se/list/listinfo/curl-library
Etiquette: https://curl.haxx.se/mail/etiquette.html