Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load

2012-12-04 Thread Henrik Nordström
mån 2012-12-03 klockan 22:25 -0700 skrev Alex Rousskov:

 I disagree with this logic. Yes, sawActivity approach supports long
 chains, but they are not the loop's fault! If some chain is too long,
 the bug is in that chain's code and should be fixed there. If it cannot
 be fixed there, then the chain is not too long, but Squid is just
 being asked to do too much.

The event engines we talk about here is

SignalEngine
EventScheduler
StoreEngine

Of these only possibly StoreEngine may need repeated work.

SignalEngine is happy if it gets polled every now and then at low
frequency.

EventScheduler do not want to be polled more than once per loop.

StoreEngine is about store callbacks. In the past this should not be
invoked too often per loop.

 It is difficult to resolve this disagreement. I argue that the
 sawActivity loop should stay because breaking call chains by I/O
 callbacks will lead to bugs, but I cannot show you any specific bug it
 will cause. You argue that some chains might be so long that they will
 starve I/O, but you cannot show any specific too long chain (after the
 heavy event handling bug is fixed).

store I/O is known to have done this in the past.

 Perhaps we can agree to let the sleeping dog lie? The bug you needed to
 fix was with the heavy events handling. My patch or your next patch fix
 that bug. Let's not remove the sawActivity loop until its removal
 becomes the best solution for some other bug. We can discuss the loop's
 pros and cons in the context of that specific bug then.

Ok.

I would probably go for both pathes. Solves different aspects of the
event loop timings.

My patch fixes the delay calculation, both end result and readability,
and enabling removal of sawActivity. But it does not fix the comm
starvation issue without removal of sawActivity. Today the comm loop
delay is often too short in current code.

Your patch addresses sawActivity and heavy events. If sawActivity loop
is removed in future then your patch have no effect but still makes code
more readable.

Regards
Henrik



Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load

2012-12-04 Thread Henrik Nordström
mån 2012-12-03 klockan 23:12 -0700 skrev Alex Rousskov:

 BTW, once zero-delay events are removed from the code, the event queue
 will no longer need to be a part of the wasActivity loop and the loop
 will disappear.

Note: Legacy zero-delay addEvent events are meant to be called once per
comm loop. Not immediately/chained. This is by design.

The fact that zero (or shorter than current workload) delay addEvent
events gets called immediately before next comm event check is a bug
introduced with the AsyncCall merge when the discussed sawActivity
event-loop-in-the-event-loop was added.

heavy events in the context of addEvent events are really this is very
heavy, and there is only allowed to be one and only one heavy event per
comm invocation. If there is multiple heavy jobs competing for time they
are meant to get scheduled in a round-robin fashion with comm inbetween
each.

Regards
Henrik



OK after long tests StoreID works for both mem and UFS store.

2012-12-04 Thread Eliezer Croitoru

OK after long tests StoreID works for both mem and UFS store.
I solved and marked all mismatching tests and for now it seems to work 
perfectrly:


The scenario I have been using is .ytimg.com domain.
It hosts many jpg pictures on different CDN's.
the examples are:
http://i2.ytimg.com/vi/Vo0Cazxj_yc/default.jpg
The ID of the img is the uri_path /vi/Vo0Cazxj_yc/default.jpg
The domain is a subdomain of ytimg.
so my match would be all subdomains of ytimg - 
ytimg.squid.internal/url_location.



the reuslt is:
http://i2.ytimg.com/vi/Vo0Cazxj_yc/default.jpg == 
http://i2000.ytimg.com/vi/Vo0Cazxj_yc/default.jpg == 
http://abc555.ytimg.com/vi/Vo0Cazxj_yc/default.jpg


This is a basic cdn example.

They idea pseudo for now is:
store the storeID in request from the helper.
next after all callouts from helpers and adaptation move store it in the 
ClientHttpRequest-store_id.
Then on StoreEntry creation use the store_id instead of 
ClientHttpRequest-uri.


this solves almost anything related to mem-obj-url and will force 
changing it to mem-obj-store_id.


Adding couple specific checks to point into the store_id will make the 
feature integration the least invasive to store and other code in squid.


The helper code for now Is too much for me and I will get back to it 
when later Amos will have more time and This part of the patch will be 
ready steady and documented.


In the mean while I will write in the wiki more about the feature.

Eliezer
--
Eliezer Croitoru
https://www1.ngtech.co.il
sip:ngt...@sip2sip.info
IT consulting for Nonprofit organizations
eliezer at ngtech.co.il


Helpers parse response bug

2012-12-04 Thread Tsantilas Christos
Hi all,

I think I found a bug in the current  helpers response parsing code: One
byte remains in helpers io buffer after parsing the response. This is
will cause problems when the next response from helper will enter squid.

The bug exist in helperHandleRead and helperReturnBuffer functions exist
in src/helper.cc file.
After the  srv-rbuf parsed, the srv-rbuf still contains on byte (a
'\0' char) and the srv-roffset is 1.

I am posting a patch which fixes the problem.
Also a second patch (helpers-fix-alternate.patch) to help you understand
the problem.

Regards,
   Christos


=== modified file 'src/helper.cc'
--- src/helper.cc	2012-11-24 14:30:02 +
+++ src/helper.cc	2012-12-04 14:17:59 +
@@ -866,42 +866,42 @@
 -- srv-stats.pending;
 ++ srv-stats.replies;
 
 ++ hlp-stats.replies;
 
 srv-answer_time = current_time;
 
 srv-dispatch_time = r-dispatch_time;
 
 hlp-stats.avg_svc_time =
 Math::intAverage(hlp-stats.avg_svc_time,
  tvSubMsec(r-dispatch_time, current_time),
  hlp-stats.replies, REDIRECT_AV_FACTOR);
 
 helperRequestFree(r);
 } else {
 debugs(84, DBG_IMPORTANT, helperHandleRead: unexpected reply on channel  
request_number   from   hlp-id_name   #  srv-index + 1 
 '  srv-rbuf  ');
 }
-srv-roffset -= (msg_end - srv-rbuf);
-memmove(srv-rbuf, msg_end, srv-roffset + 1);
+srv-roffset -= (msg_end - srv-rbuf + 1);
+memmove(srv-rbuf, msg_end, srv-roffset);
 
 if (!srv-flags.shutdown) {
 helperKickQueue(hlp);
 } else if (!srv-flags.closing  !srv-stats.pending) {
 srv-flags.closing=1;
 srv-writePipe-close();
 return;
 }
 }
 
 static void
 helperHandleRead(const Comm::ConnectionPointer conn, char *buf, size_t len, comm_err_t flag, int xerrno, void *data)
 {
 char *t = NULL;
 helper_server *srv = (helper_server *)data;
 helper *hlp = srv-parent;
 assert(cbdataReferenceValid(data));
 
 /* Bail out early on COMM_ERR_CLOSING - close handlers will tidy up for us */
 

=== modified file 'src/helper.cc'
--- src/helper.cc	2012-11-24 14:30:02 +
+++ src/helper.cc	2012-12-04 14:09:50 +
@@ -915,43 +915,44 @@
 
 if (flag != COMM_OK || len == 0) {
 srv-closePipesSafely();
 return;
 }
 
 srv-roffset += len;
 srv-rbuf[srv-roffset] = '\0';
 debugs(84, 9, helperHandleRead: '  srv-rbuf  ');
 
 if (!srv-stats.pending) {
 /* someone spoke without being spoken to */
 debugs(84, DBG_IMPORTANT, helperHandleRead: unexpected read from  
hlp-id_name   #  srv-index + 1  ,   (int)len 
 bytes '  srv-rbuf  ');
 
 srv-roffset = 0;
 srv-rbuf[0] = '\0';
 }
 
-while ((t = strchr(srv-rbuf, hlp-eom))) {
+char *msg = srv-rbuf;
+while(*msg == '\0'  (msg - srv-rbuf)  srv-roffset) msg++;
+while ((t = strchr(msg, hlp-eom))) {
 /* end of reply found */
-char *msg = srv-rbuf;
 int i = 0;
 debugs(84, 3, helperHandleRead: end of reply found);
 
 if (t  srv-rbuf  t[-1] == '\r'  hlp-eom == '\n')
 t[-1] = '\0';
 
 *t = '\0';
 
 if (hlp-childs.concurrency) {
 i = strtol(msg, msg, 10);
 
 while (*msg  xisspace(*msg))
 ++msg;
 }
 
 helperReturnBuffer(i, srv, hlp, msg, t);
 // only skip off the \0 _after_ passing its location to helperReturnBuffer
 ++t;
 }
 



Re: [PATCH] SSL server certificate fingerprint ACL type

2012-12-04 Thread Tsantilas Christos
If there is not objections I will apply the latest SSL server
certificate fingerprint ACL type patch
(certificate-fingerprint-ACL-v3.patch) to trunk


On 11/24/2012 05:47 PM, Tsantilas Christos wrote:
 On 11/23/2012 01:49 PM, Amos Jeffries wrote:
 On 15/11/2012 1:12 a.m., Tsantilas Christos wrote:
 SSL server certificate fingerprint ACL type

 This patch add the server_ssl_cert_fingerprint acl type to match
 against server SSL certificate fingerprint.
 The new acl type has the form:
acl aclname server_ssl_cert_fingerprint [-sha1] fingerprint1 ...

 The fingerprint must given in the form:
  XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX
 where X are any valid hexadecimal number

 Example usage:
 acl BrokeServer dst 192.168.1.23
 acl GoodCert server_ssl_cert_fingerprint
 AB:2A:82:AF:46:AE:1F:31:21:74:65:BF:56:47:25:D1:87:51:41:AE
 sslproxy_cert_error allow BrokeServer GoodCert
 sslproxy_cert_error deny all

 Someone can retrieve the fingerprint of a certificate using the openssl
 command:
# openssl x509 -fingerprint -in test.pem -noout
# openssl s_client -host www.paypal.com -port 443 2 /dev/null |
 openssl x509 -fingerprint   -noout


 This is a Measurement Factory project

 Finally got to it. Sorry this has taken so long...

 * ACL name seems to be a bit on the long side. How about dropping the
 ssl_ sub-section out of it and  changing _fingerprint to _sha1? 
 == server_cert_sha1
 
 This is not correct.
 The -sha1 is an argument to specify the type of fingerprint. In the
 future we may add the -md5 argument to specify an server_cert_md5 acl.
 
 I renamed the acl name to
 server_cert_fingerprint [-sha1]
 
 

 in src/acl/CertificateData.cc:
 * debugs() at CRITICAL level need ERROR/FATAL/WARNING and to explain
 where they are reporting from in a simple way.
 
 ok
 
 
  - In ACLCertificateData::parse() the message required attribute
 argument missing does not say anything useful for fixing the problem
 when it hits the logs. The other messages after are a bit more
 expressive, since they at least mention 'Acl' or what the attribute
 should be, but they all need a bit friendliness polish.
   + something along the lines of FATAL: Acl ' name? ' ...
 
 In logs someone will see something like:
  2012/11/24 17:38:10| FATAL: required attribute argument missing
  FATAL: Bungled squid-cert-validation.conf line 151: acl USERCERT user_cert
 
 Will understand whe wrong syntax.
 We do not have the ability inside ACLCertificateData to print acl name
 or other informations.
 
 
 


 in src/acl/ServerCertificate.cc:
 * please remove addition of $Id$  (same goes elsewhere)
 
 ok
 
 * fde.h and client_side.h includes should be the other way around
 (sorted alphabetical).
 ok
 
 * please remove doubled empty lines
 ok
 
 * please do not add whitespace between function name and parameter
 lists. same goes on any new code.
   -s/::match\ (/::match(/
 
 ok
 


 in src/acl/ServerCertificate.h
 * $Id$ again.
 * please do not add useless empty lines at the beginning of classes.
 +{\n\npublic:\n  - +{\npublic:\n
 
 ok
 

 in src/acl/StringData.h:
 * documentation on insert()... whats a custom value when its string data?
 
 ok.
 


 That is all that stands out at me. Not familiar enough with the cert
 operations yet to do a more through scan through, sorry.


 NP: this patch meshes with cert validator and will also need re-testing
 for build issues after that other patches wrapper macros are fixed up.

 Amos

 



Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load

2012-12-04 Thread Alex Rousskov
On 12/04/2012 01:47 AM, Henrik Nordström wrote:
 mån 2012-12-03 klockan 22:25 -0700 skrev Alex Rousskov:
 
 I disagree with this logic. Yes, sawActivity approach supports long
 chains, but they are not the loop's fault! If some chain is too long,
 the bug is in that chain's code and should be fixed there. If it cannot
 be fixed there, then the chain is not too long, but Squid is just
 being asked to do too much.
 
 The event engines we talk about here is
 
 SignalEngine
 EventScheduler
 StoreEngine
 
 Of these only possibly StoreEngine may need repeated work.
 
 SignalEngine is happy if it gets polled every now and then at low
 frequency.

 EventScheduler do not want to be polled more than once per loop.

I think that would be true if we did not have any zero-delay events
mimicking async calls. I believe we still have a few of those left. If
you have reviewed all of them and believe they all will be fine with a
non-zero delay (essentially), then I agree that EventScheduler does not
need to be polled more than once (and sawActivity loop can be removed).


 StoreEngine is about store callbacks. In the past this should not be
 invoked too often per loop.

I think it makes sense to scan store events once per loop because they
are similar to Comm I/O events and those are scanned once per loop.


 Perhaps we can agree to let the sleeping dog lie? The bug you needed to
 fix was with the heavy events handling. My patch or your next patch fix
 that bug. Let's not remove the sawActivity loop until its removal
 becomes the best solution for some other bug. We can discuss the loop's
 pros and cons in the context of that specific bug then.
 
 Ok.
 
 I would probably go for both pathes. Solves different aspects of the
 event loop timings.
 
 My patch fixes the delay calculation, both end result and readability,
 and enabling removal of sawActivity. But it does not fix the comm
 starvation issue without removal of sawActivity. Today the comm loop
 delay is often too short in current code.

My argument is that once heavy events are supported, there is no comm
starvation issue that the sawActivity loop is responsible for. Thus,
doing something to fix that issue is out of scope (and should have some
other benefit such as fixing another bug to justify risky changes).


 Your patch addresses sawActivity and heavy events. If sawActivity loop
 is removed in future then your patch have no effect but still makes code
 more readable.

Not true. My patch has an effect even without the sawActivity loop
because my changes stop EventScheduler::checkEvents() loop after the
first heavy event (IIRC, you wanted only one heavy event per main loop
iteration). And that change is all that is needed to support heavy
events after sawActivity loop is gone.


HTH,

Alex.



Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load

2012-12-04 Thread Alex Rousskov
On 12/04/2012 01:59 AM, Henrik Nordström wrote:
 mån 2012-12-03 klockan 23:12 -0700 skrev Alex Rousskov:
 
 BTW, once zero-delay events are removed from the code, the event queue
 will no longer need to be a part of the wasActivity loop and the loop
 will disappear.
 
 Note: Legacy zero-delay addEvent events are meant to be called once per
 comm loop. Not immediately/chained. This is by design.

There are several ways to interpret the designer intent when looking at
undocumented code. I cannot say whether all of the currently remaining
zero-delay events use your interpretation, but I am certain that I have
added zero-delay events in the past that used a comm-independent
interpretation (get me out of this deeply nested set of calls but
resume work ASAP). And, IMO, that is actually the right definition for
the reasons discussed in my response to Amos email.


 heavy events in the context of addEvent events are really this is very
 heavy, and there is only allowed to be one and only one heavy event per
 comm invocation. If there is multiple heavy jobs competing for time they
 are meant to get scheduled in a round-robin fashion with comm inbetween
 each.

Agreed. Both patches support that AFAICT.

Alex.



Re: [PATCH] Do not send unretriable requests on reused pinned connections

2012-12-04 Thread Alex Rousskov
On 12/01/2012 06:02 PM, Alex Rousskov wrote:
 To summarize, our choices for a pinned connection created for the
 current client are:
 
 Before sending the request:
 
 1. Reopen and repin used pconns to send unretriable requests,
just like non-pinned connection code does. This eliminates
HTTP pconn race conditions for unretriable requests. This is
what the proposed patch does for SslBump.
 
 2. Send all requests without reopening the pinned connection.
If we encounter an HTTP pconn race condition, see below.
 
 
 After the request, if an HTTP pconn race happens:
 
 0. Send an error message to the client.
This is what we do today and it breaks things.
 
 1. Reset the client connection.
Pretend to be a dumb tunnel.
 
 2. Retry, just like the current non-pinned connection code does.
This is what the proposed patch implements.
 
 
 Current code does 2+0. That does not work.
 
 The proposed patch does 1+2. That works in my limited tests.
 
 Henrik suggested 2+1. I agree that it is also an option worth
 considering.

I am leaning towards implementing 2+1 _and_ removing the current
connection repinning code as not needed.

We could leave repinning code in case it is needed later to fix broken
clients, but no such clients are known at this time, while that code is
complex and requires fixes (the patch posted here earlier), so I am
leaning towards removing it completely for now. With some effort it can
be re-introduced at a later time, of course.

If there are any better ideas, please let me know.


Thank you,

Alex.



Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load

2012-12-04 Thread Henrik Nordström
tis 2012-12-04 klockan 08:39 -0700 skrev Alex Rousskov:

  heavy events in the context of addEvent events are really this is very
  heavy, and there is only allowed to be one and only one heavy event per
  comm invocation. If there is multiple heavy jobs competing for time they
  are meant to get scheduled in a round-robin fashion with comm inbetween
  each.
 
 Agreed. Both patches support that AFAICT.

Yes.

Regards
Henrik



Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load

2012-12-04 Thread Henrik Nordström
tis 2012-12-04 klockan 08:39 -0700 skrev Alex Rousskov:

 There are several ways to interpret the designer intent when looking at
 undocumented code. I cannot say whether all of the currently remaining
 zero-delay events use your interpretation, but I am certain that I have
 added zero-delay events in the past that used a comm-independent
 interpretation (get me out of this deeply nested set of calls but
 resume work ASAP). And, IMO, that is actually the right definition for
 the reasons discussed in my response to Amos email.

I can only speak for the Squid-2 code base on this, and the intents
behind event.c[c] event handling.

 Agreed. Both patches support that AFAICT.

What I ended up with is a combination of both.

I kept sawActivity loop for now, pending review of zero-delay events and
store callbacks. I still think the sawActivity loop is undesired and not
needed, but it's not the bug we are working on right now.

But your patch also needs timeTillNextEvent to calculate the loop delay.
or the timed events starve due to loop_delay being calculated wrongly
(not taking into account for events added just now).

Note: There is a slight but critical error in my timeTillNextEvent patch
as posted. loop_delay  requested should be . Other than this it seems
to work.



And there was a couple of other problems seen today:

*) UFS store rebuilding crashes if there is cache collisions with
non-ufs stores competing for the same object. 
assertion failed: ufs/ufscommon.cc:709: sde
have a patch for this.

*) some undiagnosed crash
assertion failed: comm.cc:1259: isOpen(fd)
seen this randomly earlier, but now it hit all the workers at about the
same time +/- some seconds. This was randomly seen before event loop
rework as well.

*) shortly after the above after the workers had restarted and rebuilt
their caches all started to spew
Worker I/O push queue overflow: ipcIo1.13951w6
and calmed down after some minutes. Continued for ~10 min I think. I/O
load on the server was marginal.

Regards
Henrik



[PATCH] Re: Helpers parse response bug

2012-12-04 Thread Amos Jeffries

On 05.12.2012 03:24, Tsantilas Christos wrote:

Hi all,

I think I found a bug in the current  helpers response parsing code: 
One

byte remains in helpers io buffer after parsing the response. This is
will cause problems when the next response from helper will enter 
squid.


The bug exist in helperHandleRead and helperReturnBuffer functions 
exist

in src/helper.cc file.
After the  srv-rbuf parsed, the srv-rbuf still contains on byte (a
'\0' char) and the srv-roffset is 1.

I am posting a patch which fixes the problem.
Also a second patch (helpers-fix-alternate.patch) to help you 
understand

the problem.

Regards,
   Christos



Aha! thank you.

There are another slightly related bug here as well which I think we 
should clean out immediately. Attached patch fixes both of the below 
(1a,b) issues.


 1a) a layering issue between helperHandleRead() and 
helperReturnBuffer(). helperReturnBuffer is only dealing with the 
message sub-string, but has been made to memmove() the buffer contents 
which means it has to become aware of these problematic terminal \0 
octet(s). However helperHandleRead() is where the termination is being 
done and the buffer information is all being handled.

 - need to do the memmove() in helperHandleRead()

 1b) helpers which return \r\n are still passed the location of the \n 
as endpoint to workaround (1a) even though the \r is also replaced with 
\0 and shortens the msg portion by one octet. This affects the 
HelperReply parser length checks on responses without kv-pair.
  - need to pass the first of the termination \0 octets not the last 
to helperReturnBuffer(). This is made possible after fixing (1a).


Also,
 2) helperStatefulHandleRead() is not shuffling the buffer remainders 
forward for future handling, but assuming only one response line is sent 
per read and dropping anything else.
 - we need to do this shuffling. Documented but omitted from this 
patch, since it is not causing broken behaviour right now and I still 
have to track down the read() handlers and large response handling that 
needs to be checked with that change.



Amos
=== modified file 'src/helper.cc'
--- src/helper.cc   2012-11-24 14:30:02 +
+++ src/helper.cc   2012-12-04 23:17:03 +
@@ -848,9 +848,6 @@
 {
 helper_request *r = srv-requests[request_number];
 if (r) {
-// TODO: parse the reply into new helper reply object
-// pass that to the callback instead of msg
-
 HLPCB *callback = r-callback;
 
 srv-requests[request_number] = NULL;
@@ -883,15 +880,12 @@
request_number   from   hlp-id_name   #  
srv-index + 1 
 '  srv-rbuf  ');
 }
-srv-roffset -= (msg_end - srv-rbuf);
-memmove(srv-rbuf, msg_end, srv-roffset + 1);
 
 if (!srv-flags.shutdown) {
 helperKickQueue(hlp);
 } else if (!srv-flags.closing  !srv-stats.pending) {
 srv-flags.closing=1;
 srv-writePipe-close();
-return;
 }
 }
 
@@ -936,10 +930,16 @@
 /* end of reply found */
 char *msg = srv-rbuf;
 int i = 0;
+int skip = 1;
 debugs(84, 3, helperHandleRead: end of reply found);
 
-if (t  srv-rbuf  t[-1] == '\r'  hlp-eom == '\n')
-t[-1] = '\0';
+if (t  srv-rbuf  t[-1] == '\r'  hlp-eom == '\n') {
+t = '\0';
+// rewind to the \r octet which is the real terminal now
+// and remember that we have to skip forward 2 places now.
+skip = 2;
+--t;
+}
 
 *t = '\0';
 
@@ -951,8 +951,8 @@
 }
 
 helperReturnBuffer(i, srv, hlp, msg, t);
-// only skip off the \0 _after_ passing its location to 
helperReturnBuffer
-++t;
+srv-roffset -= (t - srv-rbuf) + skip;
+memmove(srv-rbuf, t, srv-roffset);
 }
 
 if (Comm::IsConnOpen(srv-readPipe)) {
@@ -1049,6 +1049,12 @@
 t += skip;
 
 srv-flags.busy = 0;
+/**
+ * BUG: the below assumes that only one response per read() was 
received and discards any octets remaining.
+ *  Doing this prohibits concurrency support with multiple replies 
per read().
+ * TODO: check that read() setup on these buffers pays attention to 
roffest!=0
+ * TODO: check that replies bigger than the buffer are discarded and 
do not to affect future replies
+ */
 srv-roffset = 0;
 helperStatefulRequestFree(r);
 srv-request = NULL;



Re: OK after long tests StoreID works for both mem and UFS store.

2012-12-04 Thread Amos Jeffries

On 05.12.2012 02:45, Eliezer Croitoru wrote:
snip


The helper code for now Is too much for me and I will get back to it
when later Amos will have more time and This part of the patch will 
be

ready steady and documented.


Christos found a bug which I expect is causing the Unknown. Please 
try the patch proposed to fix that bug and see if it resolves the issues 
you found as well.


Amos