On Sun, 2008-09-14 at 17:36 -0400, Ryan McKinley wrote: > On Sep 13, 2008, at 6:06 PM, Thorsten Scherler wrote: > > On Fri, 2008-09-12 at 10:12 -0400, Grant Ingersoll wrote: ... > >> 1. IMAP and other mail stores (even things like PST files, etc.) > > > > One can add new protocol implementation in no time. > > > > I'm still wrapping my head around what the best API for this kind of > extension is. Unless I'm missing something, simply adding a protocol > does not really help. The protocol API just lets you convert a url to > InputStream... what about listing other mail messages on the server? > Likewise adding FileProtocol does not implement file system > crawling... it requires special worker/crawler logic to add new File > based tasks.
Yes, and no. You are right that the protocol api has the main purpose to open a stream. However crawling is crawling the problem ATM with the fileProtocol implementation is that we do not resolve the mine type which makes our current parser useless (this changes as soon we have full support of tika). I guess you are not referring to crawling but to racing where you add all task to the queue and do not extract any other tasks from the task. > The "protocol" approach only seems appropriate for web crawling... in > general it may not always be good to go from URL->InputStream. How can we enhance this? What methods are missing? > As I > have pointed to before, I'm not sure the Parse API is generally > applicable. Also see LABS-149 I guess that we agree that we should drop our parser api and use tika one. > > I started trying to summarize what I think the overall API is/should > be, but it gets too long so I will start a new thread with that... very good. > > To make Droids useful and immediately accessible, "out of the box" > Droids should come with three crawler implementaions: > > 1. simple web crawler (like HelloCrawler) > * point to a website, get all outliks and process the content at > each "page" The helloCrawler needs some enhancements but yes that should be the reference implementation of a web-crawler. > 2. simple file system crawler ( file.listFiles( FileNameFilter() ) > * point to a directory on disk, and process all sub directories/ > files I will create a file system racer that will handle all files from a directory. However a file system crawler should be part of the helloCrawler since he should support various protocols out of the box. > 3. IMAP directory crawler > * point to a IMAP directory and process each directory/message I have to admit has been a while that I did development against a IMAP server. However an IMAP racer is indeed a nice example. As you noticed I only used once the word crawler since for me a crawler has the particularity to extract tasks from the visited pages. In an IMAP racer you will not extract any task from an eMail but process a finite number of files (all the files in a directory). > If the Droids API handles these three cases clearly and simply, it > should be easy to get lots of folks involved. I agree. :) salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
