Hi Otis.

The first beta is done (without NIO). It needs, however, further testing. Unfortunatelly, I could not find enough servers which I may hit.

I wanted to commit the robot as a part of egothor (it will use it in PULL mode), but we have a nice weather here, so I lost any motivation to play with PC ;-).

What interface do you need for Lucene? Will you use PUSH (=the robot will modify Lucene's index) or PULL (=the engine will get deltas from the robot) mode? Tell me what you need and I will try to do all my best.

-g-


Otis Gospodnetic wrote:


Leo,

Have you started this project?  Where is it hosted?
It would be nice to see a few alternative implementations of a robust
and scalable java web crawler with the ability to index whatever it
fetches.

Thanks,
Otis

--- Leo Galambos <[EMAIL PROTECTED]> wrote:


Hi.

I would like to write $SUBJ (HCDC), because LARM does not offer many options which are required by web/http crawling IMHO. Here is my
list:


1. I would like to manage the decision what will be gathered first - this would be based on pageRank, number of errors, connection speed
etc. etc.
2. pure JAVA solution without any DBMS/JDBC
3. better configuration in case of an error
4. NIO style as it is suggested by LARM specification
5. egothor's filters for automatic processing of various data formats
6. management of "Expires" HTTP-meta headers, heuristic rules which
will describe how fast a page can expire (.php often expires faster than
.html)
7. reindexing without any data exports from a full-text index
8. open protocol between the crawler and a full-text engine


If anyone wants to join (or just extend the wish list), let me know,
please.

-g-


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]





__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]








---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to