On 05.07.2012 10:00, Alex Rousskov wrote:
On 06/27/2012 03:12 AM, Amos Jeffries wrote:

A quick review of the other major bugs shows that each will take some
large design and code changes to implement a proper fix or even a
workaround.


Are there any objections to ignoring these bugs when considering a 3.2
stable release:

Our definition of a "stable release" has two criteria:

1. "Meant for production caches."

2. "begin when all known major bugs have been fixed [for 14 days]."

Criterion #1 should probably be interpreted as "Squid Project considers
the version suitable for production deployment". If you think we are
there, I have no objections -- I do not have enough information to say
whether enough users will be satisfied with current v3.2 code in
production today. Perhaps this is something we should ask on squid-users
after we close all bugs that we think should be closed?

As for Criteria #2, your question means that either we stop considering
those bugs as major OR we change criterion #2. IMHO, we should adjust
that criterion so that we do not have to play these games where we mark
something as a major bug but then decide that in the interest of a
speedier "stable" designation we are going to "ignore" it.

An adjusted initialization criteria could be phrased as

2'.  "begin when #1 is satisfied for at least 14 days"


This gives us enough flexibility to release [what we consider
suitable-for-production] code that might have major bugs in some
environments. I added "at least" because otherwise we may have to
release v3.3 as stable 14 days after v3.2 is marked stable :-). In
practice, the version should have "enough improvements" to warrant its numbering and its release but I do not want to digress in that discussion.



3124 - Cache manager stops responding when multiple workers used
** requires implementing non-blocking IPC packets between workers and
coordinator.

Has this been discussed somewhere? IPC communication is already
non-blocking so I suspect some other issue is at play here. The specific
examples of mgr commands in the bug report (userhash, sourcehash,
client_list, and netdb) seem like non-essential in most environments
and, hence, not justifying the "major" designation, but perhaps they
indicate some major implementation problem that must be fixed.


UNIX sockets apparently guarantee the write() is blocked until recipient process has read() the packet. Meaning each IPC packet is blocked behind whatever longer AsyncCall or delay the recipient has going on. Last I looked the coordinator handling function also called component handler functions synchronously for them to create the response IPC packet.

AFAIK this is waiting on the Subscription and generic (immediate-ACK) IPC packets, which will free up the coordinator and workers for other async operations even if a large process is underway.



3389 - Auto-reconnect for tcp access_log
** requires asynchronous handling of log opening and blocking Squid
operation

Since we have stable file-based logging, this bug does not have to block
a "stable" designation if TCP logging is declared "experimental". You
already have a patch that addresses 90% of the core problem for those
who care.

If you do not want to mark TCP logging as experimental and highlight
this shortcoming, then the bug ought to be fixed IMHO because there is
consensus that accurate logging is critical for many deployments.


3478 - Host verify catching dynamic CDN hosted sites
  ** requires designing a CONNECT and bump handling mechanism

I am not an expert on this, but it feels like we are trying to enforce a [good] rule ignored by the [bad] real world, especially in interception
environments. As a result, Squid lies and scares admins for no good
reason (in most cases). We will not win this battle.

I suggest that the "host_verify_strict off" behavior is adjusted to
cause no harm, even if some malicious requests will get through.


It does that now. The "no harm" means we can't re-write the request headers to something we are not sure about and would actively cause problems if we got it wrong. The current state is that Squid goes DIRECT, instead of through peers. Breaking interception+cluster setups.


I can open that up again, but it will mean updating the CVE to indicate 2nd-stage proxies are still vulnerable.


If you do not want to do that, please add a [fast] ACL so that admins
are not stuck without a solution and can whitelist bad (or all) sites.


Said that, the bug report itself does not explicitly say that something
is _seriously_ broken, does it? I bet the cache.log messages are
excessive on any busy site with a diverse user population, but we can
rate-limit these messages and downgrade the severity of the bug while
waiting for a real use case where these new checks break things (despite
host_verify_strict being off).


cache_peer relay is almost completely "disabled" for some major sites. Everything else works well.



3517 - Workers ldap digest
  ** requires SMP atomic access support for all user credentials

This is not a blocker IMO. SMP has several known limitations, complex
authentication schemes being one of them. This does not affect stability
of supported SMP configurations.


Okay, thank you.


Which would leave us with only these to locate (any takers?) :

3551 - store_rebuild.cc:116: "store_errors == 0" assertion

It would be nice to figure this one out, at least for ufs, because many folks will try ufs with SMP and there is clearly some kind of corruption
problem there. I assigned the bug to self for now.

However, if I cannot reproduce it, I will not be able to make much
progress. Please note that the original reported moved on to rock store and does not consider this bug to be affecting him any more (per comment
#10).


3556 - assertion failed: comm.cc:1093: "isOpen(fd)"

I recommend adding a guard for the comm_close() call in the Connection
destructor to avoid the call for !isOpen(fd) orphan connections. And
print the value of isOpen() in the BUG message.


Aha.


3562 - StoreEntry::kickProducer Segmentation fault

I suspect Squid is corrupting its own memory somewhere so this specific core dump cannot be trusted. This might even be the same problem as bug
3551 above. This could be considered a blocker at least until we know
more, I guess.


Thank you.

Amos

Reply via email to