Web Content Compression FAQ - new update

Slava Bizyayev Sun, 10 Apr 2005 12:33:02 -0700

Hi everyone,

Here a new version of FAQ is attached in pod format. I need someone to
update a website with it. I would additionally ask to add the first
pharagraph of the FAQ directly to
http://perl.apache.org/docs/tutorials/index.html in order to unify the
structure with the other tutorials.


As always, any comments on FAQ itself are very welcome.

Thanks,
Slava

=head1 NAME

Web Content Compression FAQ

=head1 Basics of Content Compression

Compression of outbound Web server traffic provides benefits both for Web 
clients who see shorter response times, as well as for content providers, who 
experience lower consumption of bandwidth.

Most recently, content compression for web servers has been provided mainly 
through use of the C<gzip> encoding.
Other (non perl) modules are available that provide so-called C<deflate> 
compression.
Both approaches are very similar recently and use the LZ77 algorithm combined 
with Huffman coding.
Luckily for us, to make use of them, there is no real need for most of us to 
understand
all the details of the obscure underlying mathematics of these techniques.
Apache handlers available from CPAN can usually do the dirty work.
Apache addresses content compression through handlers configured in its 
configuration file.

Compression is, by its nature, a content filter:
It always takes its input as plain ASCII data that it converts
to another C<binary> form, and outputs the result to some destination.
That is why every content compression handler usually belongs to a particular 
chain of handlers
within the content generation phase of the request-processing flow.

A C<chain of handlers> is one more common term that is good to know about when 
you plan to compress data.
There are two of them recently developed for Apache 1.3:  
C<Apache::OutputChain> and C<Apache::Filter>.
We have to keep in mind that the compression handler developed for one chain 
usually fails inside another.

Another important point deals with the order of execution of handlers in a 
particular chain.
It's pretty straightforward in C<Apache::Filter>.  For example, when you 
configure...

  PerlModule Apache::Filter
  <Files ~ "*\.blah">
    SetHandler perl-script
    PerlSetVar Filter On
    PerlHandler Filter1 Filter2 Filter3
  </Files>

...the content will go through C<Filter1> first, then the result will be 
filtered by C<Filter2>,
and finally C<Filter3> will be invoked to make the final changes in outbound 
data.

However, when you configure C<Apache::OutputChain> like...

  PerlModule Apache::OutputChain 
  PerlModule Apache::GzipChain 
  PerlModule Apache::SSIChain 
  PerlModule Apache::PassHtml 
  <Files *.html>
  SetHandler perl-script
    PerlHandler Apache::OutputChain Apache::GzipChain Apache::SSIChain 
Apache::PassHtml
  </Files>

...execution begins with C<Apache::PassHtml>.  Then the content will be 
processed with C<Apache::SSIChain>
and finally with C<Apache::GzipChain>.  C<Apache::OutputChain> will not be 
involved in content processing at all.
It is there only for the purpose of joining other handlers within the chain.

It is important to remember that the content compression handler should always 
be the last executable handler in any chain.

Another important problem of practical implementation of web content 
compression deals with the fact
that some buggy Web clients declare the ability to receive and decompress 
gzipped data in their HTTP requests,
but fail to keep their promises when an actual compressed response arrives.
This problem is addressed through the implementation of the 
C<Apache::CompressClientFixup> handler.
This handler serves the C<fixup> phase of the request-processing flow.
It is compatible with all known compression handlers and is available from CPAN.

=head1 Q: Why it is important to compress Web content?

=head2 A: Reduced equipment costs and the competitive advantage of dramatically 
faster page loads.

Web content compression noticeably increases delivery speed to clients
and may allow providers to serve higher content volumes without increasing 
hardware expenditures.
It visibly reduces actual content download time, a benefit most apparent to 
users of dialup and high-traffic connections.

Industry leaders like I<Yahoo> and I<Google> are widely using content 
compression in their businesses.

=head1 Q: How much improvement can I expect?

=head2 A: Effective compression can achieve increases in transmission 
efficiency from 3 to 20 times.

The compression ratio is highly content-dependent.
For example, if the compression algorithm is able to detect repeated patterns 
of characters,
compression will be greater than if no such patterns exist.
You can usually expect to realize an improvement between of 3 to 20 times on 
regular HTML,
JavaScript, and other ASCII content.
I have seen peak HTML file compression improvements in excess of more than 200 
times,
but such occurrences are infrequent.
On the other hand I have never seen ratios of less than 2.5 times on text/HTML 
files.
Image files normally employ their own compression techniques that reduce the 
advantage of further compression.

=for html
<blockquote>

On May 21, 2002 Peter J. Cranstone wrote to the [EMAIL PROTECTED] mailing list:

I<"...With 98% of the world on a dial up modem, all they care about is how long 
it takes to download a page.  It doesn't matter if it consumes a few more CPU 
cycles if the customer is happy.  It's cheaper to buy a newer faster box, than 
it is to acquire new customers.">

=for html
</blockquote>

=head1 Q: How hard is it to implement content compression on an existing site?

=head2 A: Implementing content compression on an existing site typically 
involves no more than installing and configuring an appropriate Apache handler 
on the Web server.

This approach works in most of the cases I have seen.
In some special cases you will need to take extra care with respect to the 
global architecture of your Web application,
but such cases may generally be readily addressed through various techniques.
To date I have found no fundamental barriers to practical implementation of Web 
content compression.

=head1 Q: Does compression work with standard Web browsers?

=head2 A: Yes. No client side changes or settings are required.

All modern browser makers claim to be able to handle compressed content and are 
able to decompress it on the fly,
transparent to the user.
There are some known bugs in some old browsers, but these can be taken into 
account
through appropriate configuration of the Web server.

I strongly recommend use of the C<Apache::CompressClientFixup> handler in your 
server configuration
in order to prevent compression for known buggy clients.

=head1 Q: Is it possible to combine the content compression with data 
encryption?

=head2 A: Yes.  Compressed content can be encrypted and securely transmitted 
over SSL.

On the client side, the browser transparently unencrypts and uncompresses the 
content for the user.
It is important to maintain the correct order of operations on server side to 
keep the transaction secure.
You must compress the content first and then apply an encryption mechanism.
This is the only order of operations current browsers support.

=head1 Q: What software is required on the server side for content compression?

=head2 A: There are four known mod_perl modules/packages for Web content 
compression available to date for Apache 1.3 (in alphabetical order):

=over 4

=item * Apache::Compress

a mod_perl handler developed by Ken Williams (U.S.), C<Apache::Compress>,
can generate gzip output through the C<Apache::Filter>.
This module accumulates all incoming data and compresses the entire content 
body as a unit.

=item * Apache::Dynagzip

a mod_perl handler developed by Slava Bizyayev, C<Apache::Dynagzip> uses the 
gzip format
to compress output dynamically through the C<Apache::Filter> or through the 
internal Unix pipe.

C<Apache::Dynagzip> is most useful when one needs to compress dynamic outbound 
Web content
(generated on the fly from databases, XML, etc.) when content length is not 
known at the time of the request.

C<Apache::Dynagzip>'s features include: 

=over 4

=item * Support for both HTTP/1.0 and HTTP/1.1.

=item * Control over the chunk size on HTTP/1.1 for on-the-fly content 
compression.

=item * Support for Perl, Java, or C/C++ CGI applications.

=item * Advanced control over the proxy cache with the configurable C<Vary> 
HTTP header.

=item * Optional control over content lifetime in the client's local cache with 
the configurable C<Expires> HTTP header.

=item * Optional support for server-side caching of the dynamically generated 
(and compressed) content.

=item * Optional extra-light compression

removal of leading blank spaces and/or blank lines, which works for all 
browsers,
including older ones that cannot uncompress gzip format.

=back

=item * Apache::Gzip

an example of the mod_perl filter developed by Lincoln Stein and Doug MacEachern
for their book I<Writing Apache Modules with Perl and C> (U.S.),
which like C<Apache::Compress> works through C<Apache::Filter>.
C<Apache::Gzip> is not available from CPAN.
The source code may be found on the book's companion Web site at 
L<http://www.modperl.com/>

=item * Apache::GzipChain

a mod_perl handler developed by Andreas Koenig (Germany),
which compresses output through C<Apache::OutputChain> using the gzip format.

C<Apache::GzipChain> currently provides in-memory compression only.
Use of this module under C<perl-5.8> or higher is appropriate for Unicode data.
UTF-8 data passed to C<Compress::Zlib::memGzip()> are converted to raw UTF-8 
before compression takes place.
Other data are simply passed through.

=back

=head1 Q: What is the typical overhead in terms of CPU use for the content 
compression?

=head2 A: Typical CPU overhead that originates from content compression is 
insignificant.

In my observations of data compression of files of up to 200K it takes less 
then 60 ms CPU on a P4 3 GHz processor.
I could not measure the lower boundary reliably for dynamic compression, 
because there are no really measurable latency.
>From the perspective of global architecture and scalability planning,
I would suggest allowing some 10 ms per request on I<regular> Web pages
in order to roughly estimate/predict the performance of the application server.

Estimation of connection times is an even less exact matter for of a variety of 
possible network-related reasons.
The worst-case scenario is pretty impressive: a slow dialup connection through 
an ISP with no proxy/buffering
holds the provider's socket for a time interval proportionate to the size of 
the requested file.
At present, gzip reduces this connection time by a factor of approximately 3-20.
If the ISP buffers its traffic, however, the content provider might not feel a 
dramatic impact -- apart of the fact
that they are paying their telecom providers for the transmission of 
considerable unnecessary data. 

=head1 Q: Is it possible to compress the output from C<Apache::Registry> with 
C<Apache::Dynagzip>?

=head2 A: Yes.  This should be fairly easy to accomplish, as follows:

If your page/application is initially configured like this:

  <Directory /path/to/subdirectory>
    SetHandler perl-script
    PerlHandler Apache::Registry
    PerlSendHeader On
    Options +ExecCGI
  </Directory>

you might replace it with the following:

  PerlModule Apache::Filter
  PerlModule Apache::Dynagzip
  PerlModule Apache::CompressClientFixup
  <Directory /path/to/subdirectory>
    SetHandler perl-script
    PerlHandler Apache::RegistryFilter Apache::Dynagzip
    PerlSendHeader On
    Options +ExecCGI
    PerlSetVar Filter On
    PerlFixupHandler Apache::CompressClientFixup
    PerlSetVar LightCompression On
  </Directory>

You should usually be all set after that.

In more common cases, you will need to replace the line:

    PerlHandler Apache::Registry

in your initial configuration file with the following lines:

    PerlHandler Apache::RegistryFilter Apache::Dynagzip
    PerlSetVar Filter On
    PerlFixupHandler Apache::CompressClientFixup

Optionally, you might add:

    PerlSetVar LightCompression On

to reduce the size of the stream for clients unable to speak gzip (like 
I<Microsoft Internet Explorer> over HTTP/1.0).

Finally, make sure you have somewhere declared

  PerlModule Apache::Filter
  PerlModule Apache::Dynagzip
  PerlModule Apache::CompressClientFixup

This basic configuration uses many defaults.  See C<Apache::Dynagzip> POD for 
further fine tuning if required.

Note, however, that C<Apache::RegistryFilter> is not I<yet another> 
C<Apache::Registry>.
You may need to adjust your script in accordance with requirements of 
C<Apache::Filer> first,
especially when the script generates any CGI/1.1-specific HTTP headers.
You can test your compatibility with the C<Apache::Filter> chain using a 
temporary configuration like:

  PerlModule Apache::Filter
  <Directory /path/to/subdirectory>
    SetHandler perl-script
    PerlHandler Apache::RegistryFilter
    PerlSendHeader On
    Options +ExecCGI
    PerlSetVar Filter On
  </Directory>

with no C<Apache::Dynagzip> involved.
See C<Apache::Filter> documentation if you have any problems.

=head1 Q: Is it possible to compress output from a Mason-driven application 
with C<Apache::Dynagzip>?

=head2 A: Yes. C<HTML::Mason::ApacheHandler> is compatible with the 
C<Apache::Filter> chain.

If your application is initially configured like:

  PerlModule HTML::Mason::ApacheHandler
  <Directory /path/to/subdirectory>
    <FilesMatch "\.html$">
      SetHandler perl-script
      PerlHandler HTML::Mason::ApacheHandler
    </FilesMatch>
  </Directory>

you may wish to replace it with the following:

  PerlModule HTML::Mason::ApacheHandler
  PerlModule Apache::Dynagzip
  PerlModule Apache::CompressClientFixup
  <Directory /path/to/subdirectory>
    <FilesMatch "\.html$">
      SetHandler perl-script
      PerlHandler HTML::Mason::ApacheHandler Apache::Dynagzip
      PerlSetVar Filter On
      PerlFixupHandler Apache::CompressClientFixup
      PerlSetVar LightCompression On
    </FilesMatch>
  </Directory>

You should be all set safely after that.

In more common cases, you will need to replace the line:

    PerlHandler HTML::Mason::ApacheHandler

in your initial configuration file with the following lines:

    PerlHandler HTML::Mason::ApacheHandler Apache::Dynagzip
    PerlSetVar Filter On
    PerlFixupHandler Apache::CompressClientFixup

Optionally, you might add:

    PerlSetVar LightCompression On

to reduce the size of the stream for clients unable to speak gzip (like 
I<Microsoft Internet Explorer> over HTTP/1.0).

Finally, make sure you have somewhere declared

  PerlModule Apache::Dynagzip
  PerlModule Apache::CompressClientFixup

This basic configuration uses many defaults.  See C<Apache::Dynagzip> POD for 
further fine tuning.

=head1 Q: Is commercial support available for C<Apache::Dynagzip>?

=head2 A: Yes.  I<Slav Company, Ltd.> provides commercial support for 
C<Apache::Dynagzip> worldwide.

Since the author of C<Apache::Dynagzip> is employed by Slav Company, service is 
effective and consistent.

=head1 Q: Why is it important to maintain a control over the chunk size?

=head2 A: It helps to reduce response latency.

C<Apache::Dynagzip> is the only handler to date that begins transmission of 
compressed data as soon
as the initial uncompressed pieces of data arrive from their source, at a time 
when the source process
may not even have completed generating the full document it is sending.
Transmission can therefore take place concurrently with creation of later 
document content.

This feature is mainly beneficial for HTTP/1.1 requests, because HTTP/1.0 does 
not support chunks.

I would also mention that the internal buffer in C<Apache::Dynagzip> always 
prevents Apache
from the creating too short chunks over HTTP/1.1, or from transmitting too 
short pieces of data over HTTP/1.0.

=head1 Q: Is it worthwhile to strip leading blank spaces prior to gzip 
compression?

=head2 A: Yes.  It is usually worthwhile to do this.

The benefits of blank space stripping are mostly significant for non-gzipped 
data transmissions.
One can expect some 5-20% reduction in stream size on regular 'structured' 
HTML, JavaScript, CSS, XML, etc.,
in this case at negligible cost in terms of CPU overhead and response delay.

After applying gzip compression, the benefits of previously applied blank space 
stripping are usually reduced
to some 0.5-1.0% of the resulting size, because gzip compresses blank spaces 
very effectively.
It is still worthwhile, however, to perform blank space stripping because:

=over 4

=item * chances are that your handler will ultimately have to send an 
uncompressed response back to a known buggy client;

=item * it really costs next-to-nothing, and every little bit helps to reduce 
the cost of data transmission, especially considering the cumulative effect of 
frequent repetitions.

=back

=head1 Q: Are there any content compression solutions for vanilla Apache 1.3?

=head2 A: Yes.  There are two compression modules written in C that are 
available for vanilla Apache 1.3:

=over 4

=item * mod_deflate

an Apache handler written in C by Igor Sysoev (Russia).

=item * mod_gzip

an Apache handler written in C, originally by Kevin Kiley, I<Remote 
Communications, Inc.> (U.S.)

=back

See their respective documentation for further details.

=head1 Q: Can I compress the output of my site at the application level?

=head2 A: Yes, if your Web server is CGI/1.1 compatible and allows you to 
create specific HTTP headers from your application, or when you use an 
application framework that carries its own handler capable of compressing 
outbound data.

For example, vanilla Apache 1.3 is CGI/1.1 compatible.
It allows development of CGI scripts/programs that can generate compressed 
outgoing streams
accomplished with specific HTTP headers.

Alternatively, on mod_perl enabled Apache, some application environments carry 
their own compression code
that can be activated through appropriate configuration:

C<Apache::ASP> does this with the C<CompressGzip> setting;

C<Apache::AxKit> uses the C<AxGzipOutput> setting to do this.

See the documentation for the particular packages for details.

=head1 Q: Are there any content compression solutions for Apache-2?

=head2 A: Yes.  A core compression module written in C, C<mod_deflate>, is 
available for Apache-2.

C<mod_deflate> for Apache-2 was written by Ian Holsman (USA).

Despite its name, C<mod_deflate> for Apache-2 provides C<gzip>-encoded content.
In accordance with the concept of output filters that was introduced in 
Apache-2,
C<mod_deflate> is capable of gzipping outbound traffic from any content 
generator, including CGI, Java, mod_perl, etc.

=over 4

=item * This module supports flushing.

=item * It is output filter-compatible.

=item * It has its own set of configuration options to maintain control over 
buggy clients.

=back

=head1 Q: When is C<Apache::Dynagzip> supposed to be ported to Apache-2?

=head2 A: There are no current plans to port C<Apache::Dynagzip> to Apache-2:

C<mod_deflate> for Apache-2 is capable of providing all basic functionality 
required for effective
dynamic content compression.
The rest can be easily addressed through implementation of the accompanying 
specific, tiny filters.
For instance, C<Apache::Clean>, which is already ported to Apache-2, can be used
to strip unnecessary blank spaces from outbound streams.

=head1 Q: Where can I read the original descriptions of the C<gzip> and 
C<deflate> formats?

=head2 A: C<gzip> format is published as rfc1952, and C<deflate> format is 
published as rfc1951.

You can find many mirrors of RFC archives on the Internet.
Try, for instance, my favorite at L<http://www.ietf.org/rfc.html>

=head1 Q: Are there any known compression problems with specific browsers?

=head2 A: Yes.  Netscape 4 has problems with compressed cascading style sheets 
and JavaScript files.

You can use C<Apache::CompressClientFixup> to disable compression for these 
files dynamically on Apache-1.3.
C<Apache::Dynagzip> is capable of providing so-called C<light compression> for 
these files.

On Apache-2, C<mod_deflate> can be configured to disable compression for these 
files dynamically,
and the C<Apache::Clean> filter can be used to strip unnecessary blank spaces.

=head1 Q: Where can I find more information about the compression features of 
modern browsers?

=head2 A: Michael Schroepl maintains a highly valuable site

See it at L<http://www.schroepl.net/projekte/mod_gzip/browser.htm>

=head1 Acknowledgments

During this work, I received a great deal of real help
from Kevin Kiley, Igor Sysoev, Michel Schroepl, and Henrik Nordstrom.
I'm thankful to all subscribers of mod_perl users mailing list, mod_gzip 
mailing list, and squid users mailing list
for the questions and discussions regarding the content compression.
I'm especially thankful to Stas Bekman for the initiative to publish this FAQ 
on mod_perl Web site.
I highly value patient efforts of Dan Hansen in making this text better 
English...

=head1 Maintainers

The maintainer is the person you should contact with updates, corrections and 
patches.

=over

=item *

Slava Bizyayev E<lt>slava (at) cpan.orgE<gt>

=back

=head1 Authors

=over

=item *

Slava Bizyayev E<lt>slava (at) cpan.orgE<gt>

=back

Only the major authors are listed above. For contributors see the Changes file.

=cut

Web Content Compression FAQ - new update

Reply via email to