I've noticed a number of sites I'm crawling and indexing, which happen to have fairly transient content I wish to index (lifespan of ~few weeks), are reporting a 301 permanent redirect, rather than a 404. The redirect just goes to a generic content no longer here page to be more helpful to normal web users. Not ideal at all, and not within my control at all.

What tactics and strategies can help mitigate this scenario?
In particular:
1) Removing these URL's from crawl DB (as they would if 404's and db.update.purge.404 = true).
2) Removing these from my Solr DB I'm indexing into.

I'm leaning towards the idea of writing an additional maintenance script that manually queries the crawldb for db_redir_perm status on urls from given hosts and manually removing these from Solr. I just fear it maybe over zealous in removing content from the index, in cases of a legitimate redirect...

Thanks!

--
Arthur Yarwood

Reply via email to