> check if the URLs no longer exist you would need to check the HTTP
> status value.

Am i right to expect there's no cache entry (new file or modification) for
a 404 not even if the domain already is cached ?


> The problem that you would get is that lots of pages would have
> changed and you need to get new images and things for them.  You would
> end up with lots more in the cache than you had before and no way of
> knowing what had changed and what has stayed the same.

my first thought is to feed this list of URLs (only the http pages,
not the subcontent URLs) to a wwwoffle builidng a cache from 
srcatch, that is, if it workded, i could delete the old cache anyway.
Assumed i exclude some stuff (which i would copy over literally)
then we are talking about less than 2 GB which should be affordable 
in one night, over some hours.

To me the problem seems to be more that i would end up with a lot of
main pages which related links are lost (because they don't exist anymore)
so the cache would be way smaller. But that's the goal anyway.


> It would depend a lot on the type of content that the web page has,
> for pages that change a lot in layout or content it might not be much
> use. 

Those are better dealt with short expiring IMHO

> for a local cache of wikipedia pages (for example) it might work

this is more the stuff i'm thinking of.



   °
 /\/

Reply via email to