Has anyone looked at caching content in case the next hotsync build fails
to get a page?  What happens now, it seems to me, is that the if the page
fails to be found, no pluckerdoc is created and the viewer reports that the
link was not downloaded.  For collecting our internal web site stuff, this
is not ideal, and I'd rather pull the last successful page from a cache if
possible.  Maybe it could have a tag at the top that says it's from cache
(a la Google) but that's not neccessary for me.

I think this can be accomplished in the spider::parse code where the tuple
is returned from the retriever.  Currently if there's an error returned we
skip to the next link.  We could add something there to pull the page from
cache.  I don't think it can be from the cache pages saved during the write
since these are already parsed up and numbered (I think).  Maybe we could
build an xml doc with the url & content after they are successfully
retrieved.  This can be saved to disk and then read back in?  I know there
are some problems with this that need to be worked out, like purging
content from the cache after so many days or when it's not needed anymore.

Any ideas?


Reply via email to