> check if the URLs no longer exist you would need to check the HTTP > status value.
Am i right to expect there's no cache entry (new file or modification) for a 404 not even if the domain already is cached ? > The problem that you would get is that lots of pages would have > changed and you need to get new images and things for them. You would > end up with lots more in the cache than you had before and no way of > knowing what had changed and what has stayed the same. my first thought is to feed this list of URLs (only the http pages, not the subcontent URLs) to a wwwoffle builidng a cache from srcatch, that is, if it workded, i could delete the old cache anyway. Assumed i exclude some stuff (which i would copy over literally) then we are talking about less than 2 GB which should be affordable in one night, over some hours. To me the problem seems to be more that i would end up with a lot of main pages which related links are lost (because they don't exist anymore) so the cache would be way smaller. But that's the goal anyway. > It would depend a lot on the type of content that the web page has, > for pages that change a lot in layout or content it might not be much > use. Those are better dealt with short expiring IMHO > for a local cache of wikipedia pages (for example) it might work this is more the stuff i'm thinking of. ° /\/
