Hi,

i would suggest two sub-projects:

1.Crawler - retrieving docs, wherever they are.....

2. DocumentHandler extract Text, create apropriate fields etc..

The second is a layer on top of lucene. First is a autonomous package, wich
should be nicely integrated with lucene/Document-Handler, but should also be
usable for other projects.

I've included my code, to show you, what i've done. It isn't too useful yet,
because it is integrated in our product, but you can get the idea. Actually i've
written two things:

1: A robot for crawling a remote server via http and writing all the data to
local filesystem, then importing it into our db and
(at the same time) replacing all links with internal links. So we could emulate
a web-Site from this crawled Data!
[com.synformation.script.utilities.importtool]

2: (I've rewritten some of the code from 1 for that, so this is much cleaner) A
customer needs a tool for importing local mini-Websites on the file-system via
an applet, send it to the Web-Server and import it as described in point 1. I've
tried to write it in a way, that it could include the functionality of point 1
(retrieving vie http), but that is mostly untested.
[com.synformation.script.utilities.fileimport]

I don't say, that you(we) should use this. But i think it's time to come to a
more concrete plans. I'm interested to help on that for the crawler.


mfg,

manfred




Attachment: utilities.zip
Description: Zip compressed data

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to