Re: High Capacity (Distributed) Crawler

Otis Gospodnetic Mon, 09 Jun 2003 12:44:53 -0700

Leo,

Have you started this project?  Where is it hosted?
It would be nice to see a few alternative implementations of a robust
and scalable java web crawler with the ability to index whatever it
fetches.


Thanks,
Otis

--- Leo Galambos <[EMAIL PROTECTED]> wrote:
> Hi.
> 
> I would like to write $SUBJ (HCDC), because LARM does not offer many 
> options which are required by web/http crawling IMHO. Here is my
> list:
> 
> 1. I would like to manage the decision what will be gathered first - 
> this would be based on pageRank, number of errors, connection speed
> etc. 
> etc.
> 2. pure JAVA solution without any DBMS/JDBC
> 3. better configuration in case of an error
> 4. NIO style as it is suggested by LARM specification
> 5. egothor's filters for automatic processing of various data formats
> 6. management of "Expires" HTTP-meta headers, heuristic rules which
> will 
> describe how fast a page can expire (.php often expires faster than
> .html)
> 7. reindexing without any data exports from a full-text index
> 8. open protocol between the crawler and a full-text engine
> 
> If anyone wants to join (or just extend the wish list), let me know,
> please.
> 
> -g-
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: High Capacity (Distributed) Crawler

Reply via email to