Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load
mån 2012-12-03 klockan 22:25 -0700 skrev Alex Rousskov: I disagree with this logic. Yes, sawActivity approach supports long chains, but they are not the loop's fault! If some chain is too long, the bug is in that chain's code and should be fixed there. If it cannot be fixed there, then the chain is not too long, but Squid is just being asked to do too much. The event engines we talk about here is SignalEngine EventScheduler StoreEngine Of these only possibly StoreEngine may need repeated work. SignalEngine is happy if it gets polled every now and then at low frequency. EventScheduler do not want to be polled more than once per loop. StoreEngine is about store callbacks. In the past this should not be invoked too often per loop. It is difficult to resolve this disagreement. I argue that the sawActivity loop should stay because breaking call chains by I/O callbacks will lead to bugs, but I cannot show you any specific bug it will cause. You argue that some chains might be so long that they will starve I/O, but you cannot show any specific too long chain (after the heavy event handling bug is fixed). store I/O is known to have done this in the past. Perhaps we can agree to let the sleeping dog lie? The bug you needed to fix was with the heavy events handling. My patch or your next patch fix that bug. Let's not remove the sawActivity loop until its removal becomes the best solution for some other bug. We can discuss the loop's pros and cons in the context of that specific bug then. Ok. I would probably go for both pathes. Solves different aspects of the event loop timings. My patch fixes the delay calculation, both end result and readability, and enabling removal of sawActivity. But it does not fix the comm starvation issue without removal of sawActivity. Today the comm loop delay is often too short in current code. Your patch addresses sawActivity and heavy events. If sawActivity loop is removed in future then your patch have no effect but still makes code more readable. Regards Henrik
Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load
mån 2012-12-03 klockan 23:12 -0700 skrev Alex Rousskov: BTW, once zero-delay events are removed from the code, the event queue will no longer need to be a part of the wasActivity loop and the loop will disappear. Note: Legacy zero-delay addEvent events are meant to be called once per comm loop. Not immediately/chained. This is by design. The fact that zero (or shorter than current workload) delay addEvent events gets called immediately before next comm event check is a bug introduced with the AsyncCall merge when the discussed sawActivity event-loop-in-the-event-loop was added. heavy events in the context of addEvent events are really this is very heavy, and there is only allowed to be one and only one heavy event per comm invocation. If there is multiple heavy jobs competing for time they are meant to get scheduled in a round-robin fashion with comm inbetween each. Regards Henrik
OK after long tests StoreID works for both mem and UFS store.
OK after long tests StoreID works for both mem and UFS store. I solved and marked all mismatching tests and for now it seems to work perfectrly: The scenario I have been using is .ytimg.com domain. It hosts many jpg pictures on different CDN's. the examples are: http://i2.ytimg.com/vi/Vo0Cazxj_yc/default.jpg The ID of the img is the uri_path /vi/Vo0Cazxj_yc/default.jpg The domain is a subdomain of ytimg. so my match would be all subdomains of ytimg - ytimg.squid.internal/url_location. the reuslt is: http://i2.ytimg.com/vi/Vo0Cazxj_yc/default.jpg == http://i2000.ytimg.com/vi/Vo0Cazxj_yc/default.jpg == http://abc555.ytimg.com/vi/Vo0Cazxj_yc/default.jpg This is a basic cdn example. They idea pseudo for now is: store the storeID in request from the helper. next after all callouts from helpers and adaptation move store it in the ClientHttpRequest-store_id. Then on StoreEntry creation use the store_id instead of ClientHttpRequest-uri. this solves almost anything related to mem-obj-url and will force changing it to mem-obj-store_id. Adding couple specific checks to point into the store_id will make the feature integration the least invasive to store and other code in squid. The helper code for now Is too much for me and I will get back to it when later Amos will have more time and This part of the patch will be ready steady and documented. In the mean while I will write in the wiki more about the feature. Eliezer -- Eliezer Croitoru https://www1.ngtech.co.il sip:ngt...@sip2sip.info IT consulting for Nonprofit organizations eliezer at ngtech.co.il
Helpers parse response bug
Hi all, I think I found a bug in the current helpers response parsing code: One byte remains in helpers io buffer after parsing the response. This is will cause problems when the next response from helper will enter squid. The bug exist in helperHandleRead and helperReturnBuffer functions exist in src/helper.cc file. After the srv-rbuf parsed, the srv-rbuf still contains on byte (a '\0' char) and the srv-roffset is 1. I am posting a patch which fixes the problem. Also a second patch (helpers-fix-alternate.patch) to help you understand the problem. Regards, Christos === modified file 'src/helper.cc' --- src/helper.cc 2012-11-24 14:30:02 + +++ src/helper.cc 2012-12-04 14:17:59 + @@ -866,42 +866,42 @@ -- srv-stats.pending; ++ srv-stats.replies; ++ hlp-stats.replies; srv-answer_time = current_time; srv-dispatch_time = r-dispatch_time; hlp-stats.avg_svc_time = Math::intAverage(hlp-stats.avg_svc_time, tvSubMsec(r-dispatch_time, current_time), hlp-stats.replies, REDIRECT_AV_FACTOR); helperRequestFree(r); } else { debugs(84, DBG_IMPORTANT, helperHandleRead: unexpected reply on channel request_number from hlp-id_name # srv-index + 1 ' srv-rbuf '); } -srv-roffset -= (msg_end - srv-rbuf); -memmove(srv-rbuf, msg_end, srv-roffset + 1); +srv-roffset -= (msg_end - srv-rbuf + 1); +memmove(srv-rbuf, msg_end, srv-roffset); if (!srv-flags.shutdown) { helperKickQueue(hlp); } else if (!srv-flags.closing !srv-stats.pending) { srv-flags.closing=1; srv-writePipe-close(); return; } } static void helperHandleRead(const Comm::ConnectionPointer conn, char *buf, size_t len, comm_err_t flag, int xerrno, void *data) { char *t = NULL; helper_server *srv = (helper_server *)data; helper *hlp = srv-parent; assert(cbdataReferenceValid(data)); /* Bail out early on COMM_ERR_CLOSING - close handlers will tidy up for us */ === modified file 'src/helper.cc' --- src/helper.cc 2012-11-24 14:30:02 + +++ src/helper.cc 2012-12-04 14:09:50 + @@ -915,43 +915,44 @@ if (flag != COMM_OK || len == 0) { srv-closePipesSafely(); return; } srv-roffset += len; srv-rbuf[srv-roffset] = '\0'; debugs(84, 9, helperHandleRead: ' srv-rbuf '); if (!srv-stats.pending) { /* someone spoke without being spoken to */ debugs(84, DBG_IMPORTANT, helperHandleRead: unexpected read from hlp-id_name # srv-index + 1 , (int)len bytes ' srv-rbuf '); srv-roffset = 0; srv-rbuf[0] = '\0'; } -while ((t = strchr(srv-rbuf, hlp-eom))) { +char *msg = srv-rbuf; +while(*msg == '\0' (msg - srv-rbuf) srv-roffset) msg++; +while ((t = strchr(msg, hlp-eom))) { /* end of reply found */ -char *msg = srv-rbuf; int i = 0; debugs(84, 3, helperHandleRead: end of reply found); if (t srv-rbuf t[-1] == '\r' hlp-eom == '\n') t[-1] = '\0'; *t = '\0'; if (hlp-childs.concurrency) { i = strtol(msg, msg, 10); while (*msg xisspace(*msg)) ++msg; } helperReturnBuffer(i, srv, hlp, msg, t); // only skip off the \0 _after_ passing its location to helperReturnBuffer ++t; }
Re: [PATCH] SSL server certificate fingerprint ACL type
If there is not objections I will apply the latest SSL server certificate fingerprint ACL type patch (certificate-fingerprint-ACL-v3.patch) to trunk On 11/24/2012 05:47 PM, Tsantilas Christos wrote: On 11/23/2012 01:49 PM, Amos Jeffries wrote: On 15/11/2012 1:12 a.m., Tsantilas Christos wrote: SSL server certificate fingerprint ACL type This patch add the server_ssl_cert_fingerprint acl type to match against server SSL certificate fingerprint. The new acl type has the form: acl aclname server_ssl_cert_fingerprint [-sha1] fingerprint1 ... The fingerprint must given in the form: XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX where X are any valid hexadecimal number Example usage: acl BrokeServer dst 192.168.1.23 acl GoodCert server_ssl_cert_fingerprint AB:2A:82:AF:46:AE:1F:31:21:74:65:BF:56:47:25:D1:87:51:41:AE sslproxy_cert_error allow BrokeServer GoodCert sslproxy_cert_error deny all Someone can retrieve the fingerprint of a certificate using the openssl command: # openssl x509 -fingerprint -in test.pem -noout # openssl s_client -host www.paypal.com -port 443 2 /dev/null | openssl x509 -fingerprint -noout This is a Measurement Factory project Finally got to it. Sorry this has taken so long... * ACL name seems to be a bit on the long side. How about dropping the ssl_ sub-section out of it and changing _fingerprint to _sha1? == server_cert_sha1 This is not correct. The -sha1 is an argument to specify the type of fingerprint. In the future we may add the -md5 argument to specify an server_cert_md5 acl. I renamed the acl name to server_cert_fingerprint [-sha1] in src/acl/CertificateData.cc: * debugs() at CRITICAL level need ERROR/FATAL/WARNING and to explain where they are reporting from in a simple way. ok - In ACLCertificateData::parse() the message required attribute argument missing does not say anything useful for fixing the problem when it hits the logs. The other messages after are a bit more expressive, since they at least mention 'Acl' or what the attribute should be, but they all need a bit friendliness polish. + something along the lines of FATAL: Acl ' name? ' ... In logs someone will see something like: 2012/11/24 17:38:10| FATAL: required attribute argument missing FATAL: Bungled squid-cert-validation.conf line 151: acl USERCERT user_cert Will understand whe wrong syntax. We do not have the ability inside ACLCertificateData to print acl name or other informations. in src/acl/ServerCertificate.cc: * please remove addition of $Id$ (same goes elsewhere) ok * fde.h and client_side.h includes should be the other way around (sorted alphabetical). ok * please remove doubled empty lines ok * please do not add whitespace between function name and parameter lists. same goes on any new code. -s/::match\ (/::match(/ ok in src/acl/ServerCertificate.h * $Id$ again. * please do not add useless empty lines at the beginning of classes. +{\n\npublic:\n - +{\npublic:\n ok in src/acl/StringData.h: * documentation on insert()... whats a custom value when its string data? ok. That is all that stands out at me. Not familiar enough with the cert operations yet to do a more through scan through, sorry. NP: this patch meshes with cert validator and will also need re-testing for build issues after that other patches wrapper macros are fixed up. Amos
Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load
On 12/04/2012 01:47 AM, Henrik Nordström wrote: mån 2012-12-03 klockan 22:25 -0700 skrev Alex Rousskov: I disagree with this logic. Yes, sawActivity approach supports long chains, but they are not the loop's fault! If some chain is too long, the bug is in that chain's code and should be fixed there. If it cannot be fixed there, then the chain is not too long, but Squid is just being asked to do too much. The event engines we talk about here is SignalEngine EventScheduler StoreEngine Of these only possibly StoreEngine may need repeated work. SignalEngine is happy if it gets polled every now and then at low frequency. EventScheduler do not want to be polled more than once per loop. I think that would be true if we did not have any zero-delay events mimicking async calls. I believe we still have a few of those left. If you have reviewed all of them and believe they all will be fine with a non-zero delay (essentially), then I agree that EventScheduler does not need to be polled more than once (and sawActivity loop can be removed). StoreEngine is about store callbacks. In the past this should not be invoked too often per loop. I think it makes sense to scan store events once per loop because they are similar to Comm I/O events and those are scanned once per loop. Perhaps we can agree to let the sleeping dog lie? The bug you needed to fix was with the heavy events handling. My patch or your next patch fix that bug. Let's not remove the sawActivity loop until its removal becomes the best solution for some other bug. We can discuss the loop's pros and cons in the context of that specific bug then. Ok. I would probably go for both pathes. Solves different aspects of the event loop timings. My patch fixes the delay calculation, both end result and readability, and enabling removal of sawActivity. But it does not fix the comm starvation issue without removal of sawActivity. Today the comm loop delay is often too short in current code. My argument is that once heavy events are supported, there is no comm starvation issue that the sawActivity loop is responsible for. Thus, doing something to fix that issue is out of scope (and should have some other benefit such as fixing another bug to justify risky changes). Your patch addresses sawActivity and heavy events. If sawActivity loop is removed in future then your patch have no effect but still makes code more readable. Not true. My patch has an effect even without the sawActivity loop because my changes stop EventScheduler::checkEvents() loop after the first heavy event (IIRC, you wanted only one heavy event per main loop iteration). And that change is all that is needed to support heavy events after sawActivity loop is gone. HTH, Alex.
Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load
On 12/04/2012 01:59 AM, Henrik Nordström wrote: mån 2012-12-03 klockan 23:12 -0700 skrev Alex Rousskov: BTW, once zero-delay events are removed from the code, the event queue will no longer need to be a part of the wasActivity loop and the loop will disappear. Note: Legacy zero-delay addEvent events are meant to be called once per comm loop. Not immediately/chained. This is by design. There are several ways to interpret the designer intent when looking at undocumented code. I cannot say whether all of the currently remaining zero-delay events use your interpretation, but I am certain that I have added zero-delay events in the past that used a comm-independent interpretation (get me out of this deeply nested set of calls but resume work ASAP). And, IMO, that is actually the right definition for the reasons discussed in my response to Amos email. heavy events in the context of addEvent events are really this is very heavy, and there is only allowed to be one and only one heavy event per comm invocation. If there is multiple heavy jobs competing for time they are meant to get scheduled in a round-robin fashion with comm inbetween each. Agreed. Both patches support that AFAICT. Alex.
Re: [PATCH] Do not send unretriable requests on reused pinned connections
On 12/01/2012 06:02 PM, Alex Rousskov wrote: To summarize, our choices for a pinned connection created for the current client are: Before sending the request: 1. Reopen and repin used pconns to send unretriable requests, just like non-pinned connection code does. This eliminates HTTP pconn race conditions for unretriable requests. This is what the proposed patch does for SslBump. 2. Send all requests without reopening the pinned connection. If we encounter an HTTP pconn race condition, see below. After the request, if an HTTP pconn race happens: 0. Send an error message to the client. This is what we do today and it breaks things. 1. Reset the client connection. Pretend to be a dumb tunnel. 2. Retry, just like the current non-pinned connection code does. This is what the proposed patch implements. Current code does 2+0. That does not work. The proposed patch does 1+2. That works in my limited tests. Henrik suggested 2+1. I agree that it is also an option worth considering. I am leaning towards implementing 2+1 _and_ removing the current connection repinning code as not needed. We could leave repinning code in case it is needed later to fix broken clients, but no such clients are known at this time, while that code is complex and requires fixes (the patch posted here earlier), so I am leaning towards removing it completely for now. With some effort it can be re-introduced at a later time, of course. If there are any better ideas, please let me know. Thank you, Alex.
Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load
tis 2012-12-04 klockan 08:39 -0700 skrev Alex Rousskov: heavy events in the context of addEvent events are really this is very heavy, and there is only allowed to be one and only one heavy event per comm invocation. If there is multiple heavy jobs competing for time they are meant to get scheduled in a round-robin fashion with comm inbetween each. Agreed. Both patches support that AFAICT. Yes. Regards Henrik
Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load
tis 2012-12-04 klockan 08:39 -0700 skrev Alex Rousskov: There are several ways to interpret the designer intent when looking at undocumented code. I cannot say whether all of the currently remaining zero-delay events use your interpretation, but I am certain that I have added zero-delay events in the past that used a comm-independent interpretation (get me out of this deeply nested set of calls but resume work ASAP). And, IMO, that is actually the right definition for the reasons discussed in my response to Amos email. I can only speak for the Squid-2 code base on this, and the intents behind event.c[c] event handling. Agreed. Both patches support that AFAICT. What I ended up with is a combination of both. I kept sawActivity loop for now, pending review of zero-delay events and store callbacks. I still think the sawActivity loop is undesired and not needed, but it's not the bug we are working on right now. But your patch also needs timeTillNextEvent to calculate the loop delay. or the timed events starve due to loop_delay being calculated wrongly (not taking into account for events added just now). Note: There is a slight but critical error in my timeTillNextEvent patch as posted. loop_delay requested should be . Other than this it seems to work. And there was a couple of other problems seen today: *) UFS store rebuilding crashes if there is cache collisions with non-ufs stores competing for the same object. assertion failed: ufs/ufscommon.cc:709: sde have a patch for this. *) some undiagnosed crash assertion failed: comm.cc:1259: isOpen(fd) seen this randomly earlier, but now it hit all the workers at about the same time +/- some seconds. This was randomly seen before event loop rework as well. *) shortly after the above after the workers had restarted and rebuilt their caches all started to spew Worker I/O push queue overflow: ipcIo1.13951w6 and calmed down after some minutes. Continued for ~10 min I think. I/O load on the server was marginal. Regards Henrik
[PATCH] Re: Helpers parse response bug
On 05.12.2012 03:24, Tsantilas Christos wrote: Hi all, I think I found a bug in the current helpers response parsing code: One byte remains in helpers io buffer after parsing the response. This is will cause problems when the next response from helper will enter squid. The bug exist in helperHandleRead and helperReturnBuffer functions exist in src/helper.cc file. After the srv-rbuf parsed, the srv-rbuf still contains on byte (a '\0' char) and the srv-roffset is 1. I am posting a patch which fixes the problem. Also a second patch (helpers-fix-alternate.patch) to help you understand the problem. Regards, Christos Aha! thank you. There are another slightly related bug here as well which I think we should clean out immediately. Attached patch fixes both of the below (1a,b) issues. 1a) a layering issue between helperHandleRead() and helperReturnBuffer(). helperReturnBuffer is only dealing with the message sub-string, but has been made to memmove() the buffer contents which means it has to become aware of these problematic terminal \0 octet(s). However helperHandleRead() is where the termination is being done and the buffer information is all being handled. - need to do the memmove() in helperHandleRead() 1b) helpers which return \r\n are still passed the location of the \n as endpoint to workaround (1a) even though the \r is also replaced with \0 and shortens the msg portion by one octet. This affects the HelperReply parser length checks on responses without kv-pair. - need to pass the first of the termination \0 octets not the last to helperReturnBuffer(). This is made possible after fixing (1a). Also, 2) helperStatefulHandleRead() is not shuffling the buffer remainders forward for future handling, but assuming only one response line is sent per read and dropping anything else. - we need to do this shuffling. Documented but omitted from this patch, since it is not causing broken behaviour right now and I still have to track down the read() handlers and large response handling that needs to be checked with that change. Amos === modified file 'src/helper.cc' --- src/helper.cc 2012-11-24 14:30:02 + +++ src/helper.cc 2012-12-04 23:17:03 + @@ -848,9 +848,6 @@ { helper_request *r = srv-requests[request_number]; if (r) { -// TODO: parse the reply into new helper reply object -// pass that to the callback instead of msg - HLPCB *callback = r-callback; srv-requests[request_number] = NULL; @@ -883,15 +880,12 @@ request_number from hlp-id_name # srv-index + 1 ' srv-rbuf '); } -srv-roffset -= (msg_end - srv-rbuf); -memmove(srv-rbuf, msg_end, srv-roffset + 1); if (!srv-flags.shutdown) { helperKickQueue(hlp); } else if (!srv-flags.closing !srv-stats.pending) { srv-flags.closing=1; srv-writePipe-close(); -return; } } @@ -936,10 +930,16 @@ /* end of reply found */ char *msg = srv-rbuf; int i = 0; +int skip = 1; debugs(84, 3, helperHandleRead: end of reply found); -if (t srv-rbuf t[-1] == '\r' hlp-eom == '\n') -t[-1] = '\0'; +if (t srv-rbuf t[-1] == '\r' hlp-eom == '\n') { +t = '\0'; +// rewind to the \r octet which is the real terminal now +// and remember that we have to skip forward 2 places now. +skip = 2; +--t; +} *t = '\0'; @@ -951,8 +951,8 @@ } helperReturnBuffer(i, srv, hlp, msg, t); -// only skip off the \0 _after_ passing its location to helperReturnBuffer -++t; +srv-roffset -= (t - srv-rbuf) + skip; +memmove(srv-rbuf, t, srv-roffset); } if (Comm::IsConnOpen(srv-readPipe)) { @@ -1049,6 +1049,12 @@ t += skip; srv-flags.busy = 0; +/** + * BUG: the below assumes that only one response per read() was received and discards any octets remaining. + * Doing this prohibits concurrency support with multiple replies per read(). + * TODO: check that read() setup on these buffers pays attention to roffest!=0 + * TODO: check that replies bigger than the buffer are discarded and do not to affect future replies + */ srv-roffset = 0; helperStatefulRequestFree(r); srv-request = NULL;
Re: OK after long tests StoreID works for both mem and UFS store.
On 05.12.2012 02:45, Eliezer Croitoru wrote: snip The helper code for now Is too much for me and I will get back to it when later Amos will have more time and This part of the patch will be ready steady and documented. Christos found a bug which I expect is causing the Unknown. Please try the patch proposed to fix that bug and see if it resolves the issues you found as well. Amos