I'm trying to figure out the best way to do a few things with Droids and
Solr. Perhaps I've missed the functionality in the API, as I'm just at
the planning stage. Any ideas or best practices would be appreciated.
1) We currently use Nutch for everything, crawling, indexing, and
searching. It has the notion of searching against the text of anchor
tags of incoming links. Is this useful? How would one do this with
Droids? I was thinking of keep a db of outgoing links, and at the end
of a crawl reverse the direction and update the records that have new
links pointing at them. Any other ideas?
2) How best to do an incremental crawl? I'm going to want to do
if-last-modified checks as I crawl. Is there a way to do
if-last-modified checks in Droids? Any ideas as to how to best do this?
Furthermore, I'm going to want to get the HTTP status code to determine
if a record should be deleted.
3) Updating the Regex block rules. I want to dynamically exclude
different paths from my crawl in the long term. If I see a certain meta
tag in page content, I don't want to go any further down that path. Any
way to do this?
- Couple of Droids questions Richard Frovarp
-