On Tue, 2004-07-20 at 10:38 -0700, jdow wrote:
> From: "Kelson Vibber" <[EMAIL PROTECTED]>
> > At 12:46 AM 7/20/2004, Duncan Hill wrote:

> > HTTP (which RDJ currently uses) already has a mechanism for this.  The
> > client can send an If-Modified-Since header, or compare an Etag from its
> > cache, and the server will check, say "no, this file hasn't changed since
> > you last downloaded it," and issue a "304 Not Modified" response
> consisting
> > of only an HTTP header and no data.  Just about every browser in existence
> > uses this method to determine that it can load a file from its cache.
> >
> > If RDJ is updated to handle this, as Chris(?) suggested, half the problem
> > is solved.  Even people downloading every two minutes are only grabbing a
> > few dozen bytes of headers instead of several K of data.
> 
> Out of idle adlepated curiosity why is RDJ using HTTP instead of
> FTP with wget? The little script I put together to grab new rules
> simplifies matters by using '/usr/bin/wget -r -l 1 -nd -N "$source$file"'.
> I never fetch what I don't need. It works like a champ.

Just to make this very clear: RDJ doesn't retrieve a ruleset unless it
has changed.  

RDJ uses 'wget -N' which issues an HTTP HEAD request and examines the
"Last-Modified: " header.  If the timestamp on the local file is older
than the Last-Modified header, wget retrieves the entire file using HTTP
GET.  a HTTP HEAD request and response is perhaps a few hundred bytes.
I have pasted an example at the end of this email.

Long ago (in a galaxy far far away) I started writing a teeny shell
script to download rulesets automatically.  I chose 'wget' as the tool
to retrieve files via HTTP simply because I was relatively familiar with
it.  Wget is mostly sufficient for RDJ's purposes but there's no reason
curl or rsync couldn't be used instead.  I'll gladly accept patches if
anyone wants to hack that support in!  I did some nasty stuff to
retrieve HTTP result codes from wget that should be reproduced using
curl.


'curl' pros:

-- Uses If-Modified-Since (Essentially the same bandwidth usage on a
"cache hit" as wget, but slightly less when a file has updated, due to
wget making two separate requests)
-- Should work well through caching proxies (had an issue reported
regarding wget's use of HTTP HEAD and cached responses)

'rsync' pros:
-- Significantly less bandwidth used when rulesets have been updated.

'rsync' cons:
-- Heavier CPU usage on both client and server
-- Requires special (eg: more than just a web server) software installed
on server (would be difficult to require all 
-- Haven't checked this, but I have a feeling a "cache hit" will use
more bandwidth than an equivalent "cache hit" via wget or curl.



Now, here is the example "wget -N" packet capture:

HEAD /rules/70_sare_specific.cf HTTP/1.0
User-Agent: Wget/1.9.1
Host: www.rulesemporium.com
Accept: */*
Connection: Keep-Alive

HTTP/1.1 200 OK
Date: Tue, 20 Jul 2004 20:11:37 GMT
Server: Apache
Accept-Ranges: bytes
X-Powered-By: PHP/4.2.2
Last-Modified: Sat, 05 Jun 2004 03:31:53 GMT
Connection: close
Content-Type: text/plain; charset=ISO-8859-1




-- 
Chris Thielen

Easily generate SpamAssassin rules to catch obfuscated spam phrases
(0BFU$C/\TED SPA/\/\ P|-|RA$ES): http://www.sandgnat.com/cmos/

Keep up to date with the latest third party SpamAssassin Rulesets:
http://www.exit0.us/index.php/RulesDuJour

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to