I'm finally able to spend a little time messing with droids -- i'm
comparing it to aperture (http://aperture.sourceforge.net/) to figure
out what the best path is.
The parts that aperture does nicely is that you define a crawler for
anything: web, file system, ical, imap etc, then index based from
that. The problem is that RDF is deeply baked into the system and I
don't see any *good* ways to extend it / scale it.
Droids looks promising, but like nutch, it seems to assume web/text
crawling.
With droids how would you make a file system crawler?
extend Protocol with file://?
Currently Parser->Parse->ParseData->Outlink[] defines the next items
to crawl. For non web crawling, what is the proposed model?
Also, it seems that Parse.java assumes you are only working with
text. How would you crawl a directory of images and index the EXIF
tags? Even considering parsing a word document (and extracting links)
-- it seems a shame that the Parse interface has to reduce everything
to setText( txt ).
Within the DefaultWorker it looks like each uri is opened twice: first
in getParse() then again in handle( Parse ). Something about that
feels wrong.
thanks
ryan