Leo, Have you started this project? Where is it hosted? It would be nice to see a few alternative implementations of a robust and scalable java web crawler with the ability to index whatever it fetches.
Thanks, Otis --- Leo Galambos <[EMAIL PROTECTED]> wrote: > Hi. > > I would like to write $SUBJ (HCDC), because LARM does not offer many > options which are required by web/http crawling IMHO. Here is my > list: > > 1. I would like to manage the decision what will be gathered first - > this would be based on pageRank, number of errors, connection speed > etc. > etc. > 2. pure JAVA solution without any DBMS/JDBC > 3. better configuration in case of an error > 4. NIO style as it is suggested by LARM specification > 5. egothor's filters for automatic processing of various data formats > 6. management of "Expires" HTTP-meta headers, heuristic rules which > will > describe how fast a page can expire (.php often expires faster than > .html) > 7. reindexing without any data exports from a full-text index > 8. open protocol between the crawler and a full-text engine > > If anyone wants to join (or just extend the wish list), let me know, > please. > > -g- > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __________________________________ Do you Yahoo!? Yahoo! Calendar - Free online calendar with sync to Outlook(TM). http://calendar.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]