On 22.11.2012 08:06, Eliezer Croitoru wrote:
Last time long ago There was a talk about URL storing the original
request URL at the swap_file Meta data.

Now it strikes me again while testing something.
the code of:

http://bazaar.launchpad.net/~squid/squid/trunk/view/head:/src/StoreMetaURL.cc#L39
  (25 lines of code)

##start
bool
StoreMetaURL::checkConsistency(StoreEntry *e) const
{
    assert (getType() == STORE_META_URL);

    debugs(20, DBG_IMPORTANT, "storeClientReadHeader: URL
checkConsistency wasn't used  ");
        return true;

    if (!e->mem_obj->original_url)
        return true;

    if (strcasecmp(e->mem_obj->original_url, (char *)value)) {
debugs(20, DBG_IMPORTANT, "storeClientReadHeader: URL mismatch");
        debugs(20, DBG_IMPORTANT, "\t{" << (char *) value << "} != {"
<< e->mem_obj->original_url << "}");
        return false;
    }

    return true;
}

##end

The code responsible to check the consistency of a cached file\object
URL against the current requested URL.
It's being used at store_client.cc and move from there in newer revisions.
In the old revision 4338 it states that the meaning of this code is:
"Check the meta data and make sure we got the right object."

The problem is that it only being checked while a file is being
fetched from UFS(what I have checked) while from RAM it wont be
checked.
The result is that when store_url_rewrite feature is being used the
check points on inconsistency between the request url and the object
in cache_dir (naturally).

Disabling this check will make my life easy with store_url making it
from "not" to "works".

So I have couple options how to "fix" the issue:
1. disable this check.
2. disable this check for only store_url_rewritten requests.
3. adding the store_url meta object into the cache file and use it to
identify the expected url.
4. add on\off switch to disable this check.
5. others?

After a small talk with alex I sat down and made some calculations
about MD5 collision risks.
The hash used to make the index hash is a string from "byte + url".
For most caches that I know of there is a very low probability for
collision considering the amount of objects and urls.

Yes we are talking about many many objects and it is possible but
it's not only the URL hash but some other unknowns like request and
response headers which makes this whole calculation a bit far from
reality to hit and taking it from 2^64 chance of collision to more
then 2^124.
It seems to me like it will take some amount of time until I will
see(never seen) hash collision.

What do you think?

I think the usual method of calculations for hash collisions are a little biased towards an even distribution of bytes. Whereas real-world URL space is a lot tighter - with a far greater similarity between any two similar-length URLs than in normal text of same length. I'm not certain what effect this has on the hash or how best to compensate though.


Have you seen real world scenario of collision?

No.

I'm kind of having the opinion that we should try (2) if that non-UFS are also skipping it anyway.

Amos

Reply via email to