On Wed, 2008-09-03 at 14:56 +0200, Ryan McKinley wrote:
> I'm finally able to spend a little time messing with droids -- i'm  
> comparing it to aperture (http://aperture.sourceforge.net/) to figure  
> out what the best path is.
> 
> The parts that aperture does nicely is that you define a crawler for  
> anything: web, file system, ical, imap etc, then index based from  
> that.  The problem is that RDF is deeply baked into the system and I  
> don't see any *good* ways to extend it / scale it.
> 
> Droids looks promising, but like nutch, it seems to assume web/text  
> crawling.

Actually not, the default implementation happens to be a web crawler but
you can change the crawling behavior if you do not need link extraction.
The only protocol factory till now is like you noticed the http
implementation.  

> 
> With droids how would you make a file system crawler?

That depends on the use case. 

If you need link extraction (crawler droid) then I would create a file
system plugin (implements Protocol). Then define the new protocol in the
droids-core-context.xml like 
<bean name="org.apache.droids.api.Protocol/file"
    class="org.apache.droids.protocol.file.File" scope="prototype"/>

Maybe I can add an implementation tonight. 

> Currently Parser->Parse->ParseData->Outlink[] defines the next items  
> to crawl. For non web crawling, what is the proposed model?

Do you want to extract links from the crawled documents? 

If not in some use cases I know all urls that I want to work with in my
droid, so I override initQueue() and create the queue with custom
business logic. 

> Also, it seems that Parse.java assumes you are only working with  
> text.  

With a textual representation of the incoming stream. However on the
todo list one top priority is to reuse Tika for the parser
implementation.

> How would you crawl a directory of images and index the EXIF  
> tags?  

I would create the queue overriding the initQueue() method from the
defaultCrawler. Then I would override the run() method of the
defaultWorker and do the extraction of the EXIF tags from custom Handler
plugin.

> Even considering parsing a word document (and extracting links)  
> -- it seems a shame that the Parse interface has to reduce everything  
> to setText( txt ).

I am very open for feedback to enhance the api. Some parts are a bit
historical and need a review. Do you have a suggestion to enhance it, I
welcome every help.

> 
> Within the DefaultWorker it looks like each uri is opened twice: first  
> in getParse() then again in handle( Parse ).  Something about that  
> feels wrong.

You mean it is more efficient to open the stream once and reuse it
later, yeah I agree.

salu2

> 
> thanks
> ryan
> 
> 
> 
-- 
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to