Re: Content-Encoding and storage forma

2004-03-02 Thread Jon Kay
> Applying Content-Encoding in an accelerator makes sense, and can be done
> reasonably well. Applying Content-Encoding in a general purpose Internet
> proxy is a different beast and you then need to be very careful.

Yes, indeed.

Looking at the spec, I've decided to add a squid.conf flag to turn
content encoding off if desired.  That seems like a good idea anyway
for other reasons.

> A recoded object such as gzip can be regarded semantically equivalent
> providing the user-agent knows how to decode gzip, but are obviously not
> binary equivalent to the non-encoded entity.  If you are 100% certain that
> all user-agents ever accessing contents from this server accepts gzip
> content-encoding then you may use the same weak ETag for both original and
> encoded, but if there ever is cases where clients should get the original
> then you must not, as if you do you instruct downstream caches the gzip
> and original are equivalent regardless of what the client accepts.

I think our decision not to keep just encoded versions around
immunizes us from that one; I don't see how a redecoding could arise,
as encoded  versions follow different paths to encoding-accepting
clients than decoded versions to unaccepting, purist clients.

Now, one troubling aspect to this is that different caches can
generate different valid encodings of the same object.  Can you guys think
of an action path by which that could produce corrupt results?


Jon




Re: Content-Encoding and storage forma

2004-03-02 Thread Henrik Nordstrom
On Tue, 2 Mar 2004, Jon Kay wrote:

> I think our decision not to keep just encoded versions around
> immunizes us from that one; I don't see how a redecoding could arise,
> as encoded  versions follow different paths to encoding-accepting
> clients than decoded versions to unaccepting, purist clients.

I do not quite follow what you are saying here.

The issues is not about what happens within a single Squid but what
happens at the clients or in a cache mesh.

> Now, one troubling aspect to this is that different caches can
> generate different valid encodings of the same object.  Can you guys think
> of an action path by which that could produce corrupt results?

There is plenty in cases of range merging if you consider any multi-path
situation

a) Clients having multiple paths, such as a laptop moving between networks 
where recoding is used and where not.

b) Secondry caches with clients of different classes.

c) Change of encoding details (compression level etc) making the object 
binary different.


If you modify the ETag to include details on how the object has been
recoded then you are immune as each variant then has a different identity.  
Also if you use weak etags you are mostly immune to your own actions, but
there is secondary caching implications where clients may get a different
encoding than expected because the two are told to be semantically 
equivalent.

If you allow strong etags which does not account for the recoding or there
is no ETag then you must be very careful to ensure there under no 
conditions is multiple paths between the clients and the recoding proxy, 
and that recoding parameters are not modified.

Regards
Henrik



next version of content-encoding / gzip design doc

2004-03-02 Thread Jon Kay
Here's a new version of the design document, that incorporates the
results of your suggestions.
I hope this is better...


Jon


Gzip Content-Encoding in Squid Design

Version Choice

The goal will be to get these changes into Squid3 HEAD.

Content-Encoding Protocol

Because current browser implementations treat Content-Encoding much as
though it was Transfer-Encoding, we will implement Content-Encoding and
Accept-Encoding as though they were actually the Transfer-Encoding and
TE
described in the HTTP specifications.

Etags of replies encoded by Squid will be modified to turn them into
weak
tags if they are not already so.

There will be a configuration option to turn off content-encoding.

Content-Encoding Implementation

New HttpHdrContCode module, that parses related HTTP headers, and
arranges
for encoding or decoding appropriately. Includes the following
functions:

   * codeParseRequest(): Called from client_side:parseHttpRequest()
 after clientStreamInit() call. Checks for and parses
 Allow-Encoding headers. Instantiates content_coding appropriately,
 and calls codeClientStreamInit().
   * codeClientStreamInit(): Adds a new node to clientStream with
 codeStreamRead(), codeStreamCallback(), and codeStreamStatus()
 functions.
   * codeStreamCallback()set up encoding/decoding state depending on
 combination of Content-Encoding and Allow-Encoding fields seen.
   * codeStreamRead(): call HttpContentCoder transformation functions
 appropriately.
   * codeStreamStatus(): report status to stream.
   * codeDupNode(): Alloc new store_entry and insert new clientStream
 dup node (see below) to (v?)copy data to store_entry as well as
 reply.

New HttpContentCoder abstract type, with functions:

   * encodeStart()
   * encodeEnd()
   * encodeChunk()
   * decodeStart()
   * decodeEnd()
   * decodeChunk()

New per-coded-object ContentCoderState, to handle coding state. It'll be

referenced from the clientStream, and include fields:

   * HttpContentCoder *coder
   * off_t codedOffset

Objects will be stored both in unencoded and encoded formats. An object
will
stay in the format in which Squid receives it until requested by a
client
requesting a different Content-Encoding which Squid supports (this could

be
immediate). Once this happens, the object will be streamed coded into a
different StoreEntry and on to the client.

A new store_dup module will be created to manage dup store_entries and
make
sure duplicate entries are invalidated when a new version of an object
is
read. It consists of a circular list of StoreEntry pointers named
"dupnext"
and "dupprev" When a new duplicate encoding (or decoding) of an object
is
created, it's added to the list. When any StoreEntry is invalidated or
updated, all dups are invalidated. Functions:

   * storeNewDup(): called from codeDupNode(), above, and creates new
 node with the dup'ed node attached via the dup list.
   * storeDupClientStreamInit(): called from codeDupNode(), and adds
 new clientStreamNode to copy off encoded data to new node as well
 as reply.
   * storeDupClientStreamRead(): does copying off.
   * storeDupClientStreamCallback(): null function
   * storeDupClientStreamStatus(): returns status

Other changes needed:
*Add new content_coding field to HttpReply.
*New httpHeaderGetContentEncoding(HttpReply *) function in
HttpHeader.cc.
*HttpReply:httpReplySetHeaders will weaken the etag if appropriate.
*A new configuration flag to turn content-encoding off, if desired.

Gzip

A new GzipContentCoder module, which will be an instance of
HttpContentCoder.

Data encoding will be handled by the gzip.org zlib library.

Functions:

   * gzEncodeStart: call inflateInit2(), write header
   * gzEncodeEnd: write trailer
   * gzEncodeChunk: call inflate()
   * gzDecodeStart: call deflateInit2(), read and verify header
   * gzDecodeEnd: verify trailer
   * gzDecodeChunk: call deflate()
   * gzDoSaveEncoded(): true

Test Strategy

Must pass the test suite.

Must add appropriate tests, including sending gzipped content to oneself

successfully.

Will also test against Apache mod_gzip implementation, and maybe even
gunzip.