I've noticed a number of sites I'm crawling and indexing, which happen
to have fairly transient content I wish to index (lifespan of ~few
weeks), are reporting a 301 permanent redirect, rather than a 404. The
redirect just goes to a generic content no longer here page to be more
helpful to normal web users. Not ideal at all, and not within my control
at all.
What tactics and strategies can help mitigate this scenario?
In particular:
1) Removing these URL's from crawl DB (as they would if 404's and
db.update.purge.404 = true).
2) Removing these from my Solr DB I'm indexing into.
I'm leaning towards the idea of writing an additional maintenance script
that manually queries the crawldb for db_redir_perm status on urls from
given hosts and manually removing these from Solr. I just fear it maybe
over zealous in removing content from the index, in cases of a
legitimate redirect...
Thanks!
--
Arthur Yarwood