Dan Jacobson <[EMAIL PROTECTED]> writes:

> To see what files have been fetched repeatedly recently, try
> cd /var/cache/wwwoffle && find *time* -name U\*|xargs sed :|sort|uniq -c|sort -nr
> In particular, some sites you visit often might have broken graphics
> that get refetched each visit, even though unchanged, or something.

I have taken the opportunity to examine the command that you propose
and have added to it.  It now provides also the name of the file that
contains the data for the URL.  This is useful for the second script
which examines all of the versions of the cached file and prints out
the duplicated headers.

The point of looking at the cached headers is to see why the file was
fetched more than once.  The sort of things that might be seen are:

Different 'Expires' headers for recent dates
        WWWOFFLE is re-fetching the page because the headers say that
        the page has expired.  You can stop WWWOFFLE doing this by
        setting 'request-expired = no' in the OnlineOptions section of
        the wwwoffle.conf file.

A 'Cache-Control: no-cache' or 'Pragma: no-cache' header
        WWWOFFLE is re-fetching the page because it contains a header
        that says that it is not to be cached.  You can stop WWWOFFLE
        doing this by setting 'reqeust-no-cache = no' in the
        OnlineOptions section of the wwwoffle.conf file.

A 'Cache-Control: max-age=...' header and a 'Date' header
        The specified age in seconds needs to be added to the date,
        this sets an expiry time for the page and is treated the same
        way as an 'Expires' header.

Different 'ETag' headers
        WWWOFFLE is re-fetching the page because when it asks the
        server if the page has changed (a different Etag) the server
        says that it has (even though it may not have changed).
        Disabling this will become an option in the next version of
        WWWOFFLE.


The scripts are below along with examples of their use.

-------------------- find-duplicate-urls.sh --------------------
#!/bin/sh

cd /var/spool/wwwoffle

find *time* -name U\* -exec sh -c "echo -n {} ; echo -n ' ' ; cat {} ; echo" \; |\
sed 's%[a-z0-9]*/U%D%g' |\
sort -k 2 |\
uniq -c -f 1 |\
sort
-------------------- find-duplicate-urls.sh --------------------

For example:

$ ./find-duplicate-urls.sh
...
      2 DikktDk-E6-RJzjBO0gLYTw 
http://images.slashdot.org/topics/topicgamesportable.gif
...


-------------------- find-duplicate-headers.sh --------------------
#!/bin/sh

if [ "$1" = "" ]; then
   echo Specify a filename
   exit
fi

cd /var/spool/wwwoffle

echo *time*/$1 |\
xargs -n 1 awk '/^\r$/ {finished=1} {if(!finished) print}' |\
sort |\
uniq -c
-------------------- find-duplicate-headers.sh --------------------

For example:

$ ./find-duplicate-headers.sh DikktDk-E6-RJzjBO0gLYTw
      2 Accept-Ranges: bytes
      2 Cache-Control: max-age=604800
      2 Connection: close
      2 Content-Length: 1278
      2 Content-Type: image/gif
      1 Date: Fri, 27 Jun 2003 05:16:13 GMT
      1 Date: Sat, 28 Jun 2003 05:22:10 GMT
      1 ETag: "1b408f-4fe-b7d3a080"
      1 ETag: "2de1c-4fe-b7d3a080"
      1 Expires: Fri, 04 Jul 2003 05:16:13 GMT
      1 Expires: Sat, 05 Jul 2003 05:22:10 GMT
      2 HTTP/1.1 200 OK
      2 Last-Modified: Fri, 27 Jun 2003 03:41:38 GMT
      2 Server: Apache/2.0.46 (Unix) mod_ssl/2.0.46 OpenSSL/0.9.6c

This has probably been refetched because of the Etag header.

-- 
Andrew.
----------------------------------------------------------------------
Andrew M. Bishop                             [EMAIL PROTECTED]
                                      http://www.gedanken.demon.co.uk/

WWWOFFLE users page:
        http://www.gedanken.demon.co.uk/wwwoffle/version-2.7/user.html

Reply via email to