I'm finally able to spend a little time messing with droids -- i'm comparing it to aperture (http://aperture.sourceforge.net/) to figure out what the best path is.

The parts that aperture does nicely is that you define a crawler for anything: web, file system, ical, imap etc, then index based from that. The problem is that RDF is deeply baked into the system and I don't see any *good* ways to extend it / scale it.

Droids looks promising, but like nutch, it seems to assume web/text crawling.

With droids how would you make a file system crawler?
extend Protocol with file://?
Currently Parser->Parse->ParseData->Outlink[] defines the next items to crawl. For non web crawling, what is the proposed model?

Also, it seems that Parse.java assumes you are only working with text. How would you crawl a directory of images and index the EXIF tags? Even considering parsing a word document (and extracting links) -- it seems a shame that the Parse interface has to reduce everything to setText( txt ).

Within the DefaultWorker it looks like each uri is opened twice: first in getParse() then again in handle( Parse ). Something about that feels wrong.

thanks
ryan



Reply via email to