Has anyone looked at caching content in case the next hotsync build fails to get a page? What happens now, it seems to me, is that the if the page fails to be found, no pluckerdoc is created and the viewer reports that the link was not downloaded. For collecting our internal web site stuff, this is not ideal, and I'd rather pull the last successful page from a cache if possible. Maybe it could have a tag at the top that says it's from cache (a la Google) but that's not neccessary for me. I think this can be accomplished in the spider::parse code where the tuple is returned from the retriever. Currently if there's an error returned we skip to the next link. We could add something there to pull the page from cache. I don't think it can be from the cache pages saved during the write since these are already parsed up and numbered (I think). Maybe we could build an xml doc with the url & content after they are successfully retrieved. This can be saved to disk and then read back in? I know there are some problems with this that need to be worked out, like purging content from the cache after so many days or when it's not needed anymore. Any ideas?
