On Tue, 2004-07-20 at 10:38 -0700, jdow wrote: > From: "Kelson Vibber" <[EMAIL PROTECTED]> > > At 12:46 AM 7/20/2004, Duncan Hill wrote:
> > HTTP (which RDJ currently uses) already has a mechanism for this. The > > client can send an If-Modified-Since header, or compare an Etag from its > > cache, and the server will check, say "no, this file hasn't changed since > > you last downloaded it," and issue a "304 Not Modified" response > consisting > > of only an HTTP header and no data. Just about every browser in existence > > uses this method to determine that it can load a file from its cache. > > > > If RDJ is updated to handle this, as Chris(?) suggested, half the problem > > is solved. Even people downloading every two minutes are only grabbing a > > few dozen bytes of headers instead of several K of data. > > Out of idle adlepated curiosity why is RDJ using HTTP instead of > FTP with wget? The little script I put together to grab new rules > simplifies matters by using '/usr/bin/wget -r -l 1 -nd -N "$source$file"'. > I never fetch what I don't need. It works like a champ. Just to make this very clear: RDJ doesn't retrieve a ruleset unless it has changed. RDJ uses 'wget -N' which issues an HTTP HEAD request and examines the "Last-Modified: " header. If the timestamp on the local file is older than the Last-Modified header, wget retrieves the entire file using HTTP GET. a HTTP HEAD request and response is perhaps a few hundred bytes. I have pasted an example at the end of this email. Long ago (in a galaxy far far away) I started writing a teeny shell script to download rulesets automatically. I chose 'wget' as the tool to retrieve files via HTTP simply because I was relatively familiar with it. Wget is mostly sufficient for RDJ's purposes but there's no reason curl or rsync couldn't be used instead. I'll gladly accept patches if anyone wants to hack that support in! I did some nasty stuff to retrieve HTTP result codes from wget that should be reproduced using curl. 'curl' pros: -- Uses If-Modified-Since (Essentially the same bandwidth usage on a "cache hit" as wget, but slightly less when a file has updated, due to wget making two separate requests) -- Should work well through caching proxies (had an issue reported regarding wget's use of HTTP HEAD and cached responses) 'rsync' pros: -- Significantly less bandwidth used when rulesets have been updated. 'rsync' cons: -- Heavier CPU usage on both client and server -- Requires special (eg: more than just a web server) software installed on server (would be difficult to require all -- Haven't checked this, but I have a feeling a "cache hit" will use more bandwidth than an equivalent "cache hit" via wget or curl. Now, here is the example "wget -N" packet capture: HEAD /rules/70_sare_specific.cf HTTP/1.0 User-Agent: Wget/1.9.1 Host: www.rulesemporium.com Accept: */* Connection: Keep-Alive HTTP/1.1 200 OK Date: Tue, 20 Jul 2004 20:11:37 GMT Server: Apache Accept-Ranges: bytes X-Powered-By: PHP/4.2.2 Last-Modified: Sat, 05 Jun 2004 03:31:53 GMT Connection: close Content-Type: text/plain; charset=ISO-8859-1 -- Chris Thielen Easily generate SpamAssassin rules to catch obfuscated spam phrases (0BFU$C/\TED SPA/\/\ P|-|RA$ES): http://www.sandgnat.com/cmos/ Keep up to date with the latest third party SpamAssassin Rulesets: http://www.exit0.us/index.php/RulesDuJour
signature.asc
Description: This is a digitally signed message part
