I'm trying to figure out the best way to do a few things with Droids and Solr. Perhaps I've missed the functionality in the API, as I'm just at the planning stage. Any ideas or best practices would be appreciated.

1) We currently use Nutch for everything, crawling, indexing, and searching. It has the notion of searching against the text of anchor tags of incoming links. Is this useful? How would one do this with Droids? I was thinking of keep a db of outgoing links, and at the end of a crawl reverse the direction and update the records that have new links pointing at them. Any other ideas?

2) How best to do an incremental crawl? I'm going to want to do if-last-modified checks as I crawl. Is there a way to do if-last-modified checks in Droids? Any ideas as to how to best do this? Furthermore, I'm going to want to get the HTTP status code to determine if a record should be deleted.

3) Updating the Regex block rules. I want to dynamically exclude different paths from my crawl in the long term. If I see a certain meta tag in page content, I don't want to go any further down that path. Any way to do this?


Reply via email to