On Wed, 2008-09-03 at 14:56 +0200, Ryan McKinley wrote:
> I'm finally able to spend a little time messing with droids -- i'm
> comparing it to aperture (http://aperture.sourceforge.net/) to figure
> out what the best path is.
>
> The parts that aperture does nicely is that you define a crawler for
> anything: web, file system, ical, imap etc, then index based from
> that. The problem is that RDF is deeply baked into the system and I
> don't see any *good* ways to extend it / scale it.
>
> Droids looks promising, but like nutch, it seems to assume web/text
> crawling.
Actually not, the default implementation happens to be a web crawler but
you can change the crawling behavior if you do not need link extraction.
The only protocol factory till now is like you noticed the http
implementation.
>
> With droids how would you make a file system crawler?
That depends on the use case.
If you need link extraction (crawler droid) then I would create a file
system plugin (implements Protocol). Then define the new protocol in the
droids-core-context.xml like
<bean name="org.apache.droids.api.Protocol/file"
class="org.apache.droids.protocol.file.File" scope="prototype"/>
Maybe I can add an implementation tonight.
> Currently Parser->Parse->ParseData->Outlink[] defines the next items
> to crawl. For non web crawling, what is the proposed model?
Do you want to extract links from the crawled documents?
If not in some use cases I know all urls that I want to work with in my
droid, so I override initQueue() and create the queue with custom
business logic.
> Also, it seems that Parse.java assumes you are only working with
> text.
With a textual representation of the incoming stream. However on the
todo list one top priority is to reuse Tika for the parser
implementation.
> How would you crawl a directory of images and index the EXIF
> tags?
I would create the queue overriding the initQueue() method from the
defaultCrawler. Then I would override the run() method of the
defaultWorker and do the extraction of the EXIF tags from custom Handler
plugin.
> Even considering parsing a word document (and extracting links)
> -- it seems a shame that the Parse interface has to reduce everything
> to setText( txt ).
I am very open for feedback to enhance the api. Some parts are a bit
historical and need a review. Do you have a suggestion to enhance it, I
welcome every help.
>
> Within the DefaultWorker it looks like each uri is opened twice: first
> in getParse() then again in handle( Parse ). Something about that
> feels wrong.
You mean it is more efficient to open the stream once and reuse it
later, yeah I agree.
salu2
>
> thanks
> ryan
>
>
>
--
Thorsten Scherler thorsten.at.apache.org
Open Source Java consulting, training and solutions
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]