Re: High Capacity (Distributed) Crawler

Otis Gospodnetic Tue, 10 Jun 2003 00:09:56 -0700

Leo,

> The first beta is done (without NIO). It needs, however, further 
> testing. Unfortunatelly, I could not find enough servers which I may
> hit.


Nice.  Pretty much any site is a candidate, as long as you are nice to
it.
You could, for instance, hit all dmoz URLs.  Or you could extract a set
of links from Yahoo.  Or you could try finding that small and large set
of URLs that Google provided a while ago for their Google Challenge.

> I wanted to commit the robot as a part of egothor (it will use it in 
> PULL mode), but we have a nice weather here, so I lost any motivation
> to play with PC ;-).

Yes, I hear some places in central Europe having temperatures of 36-38
C.  Hot!
We are not that lucky in NYC this year :(  Lots of rain and cloudy
weather, which is atypical.

> What interface do you need for Lucene? Will you use PUSH (=the robot 
> will modify Lucene's index) or PULL (=the engine will get deltas from
> 
> the robot) mode? Tell me what you need and I will try to do all my
> best.

I'd imagine one would want to use it in the PUSH mode (e.g. the crawler
fetches a web page and adds it to the searchable index).
How does PULL mode work?  I've never heard of web crawlers being used
in the PULL mode.  What exactly does that mean, could you please
describe it?

Thanks,
Otis


> Otis Gospodnetic wrote:
> 
> >Leo,
> >
> >Have you started this project?  Where is it hosted?
> >It would be nice to see a few alternative implementations of a
> robust
> >and scalable java web crawler with the ability to index whatever it
> >fetches.
> >
> >Thanks,
> >Otis
> >
> >--- Leo Galambos <[EMAIL PROTECTED]> wrote:
> >  
> >
> >>Hi.
> >>
> >>I would like to write $SUBJ (HCDC), because LARM does not offer
> many 
> >>options which are required by web/http crawling IMHO. Here is my
> >>list:
> >>
> >>1. I would like to manage the decision what will be gathered first
> - 
> >>this would be based on pageRank, number of errors, connection speed
> >>etc. 
> >>etc.
> >>2. pure JAVA solution without any DBMS/JDBC
> >>3. better configuration in case of an error
> >>4. NIO style as it is suggested by LARM specification
> >>5. egothor's filters for automatic processing of various data
> formats
> >>6. management of "Expires" HTTP-meta headers, heuristic rules which
> >>will 
> >>describe how fast a page can expire (.php often expires faster than
> >>.html)
> >>7. reindexing without any data exports from a full-text index
> >>8. open protocol between the crawler and a full-text engine
> >>
> >>If anyone wants to join (or just extend the wish list), let me
> know,
> >>please.
> >>
> >>-g-
> >>
> >>
>
>>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail:
> [EMAIL PROTECTED]
> >>
> >>    
> >>
> >
> >
> >__________________________________
> >Do you Yahoo!?
> >Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
> >http://calendar.yahoo.com
> >
>
>---------------------------------------------------------------------
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >  
> >
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__________________________________
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: High Capacity (Distributed) Crawler

Reply via email to