Paul Sutter wrote: > I think that Nutch has to solve the problem: if you leave the problem to the > websites, they're more likely to cut you off than they are to implement > their own index storage scheme. Besides, they'd get it wrong, have stale > data, etc. >
agreed > Maybe what is needed is brainstorming on a shared crawling scheme > implemented in Nutch. Maybe something based on a bittorrent-like protocol? > I am not sure if I understand, can you explain a bit? What comes to my mind is a server (service) acting as an index pointer/referer. Let's say I have indexed the NYT today then I would notify this server about it and also where the index can be retrieved from. So somebody else could first contact this server and check if somebody has recently indexed NYT. Of course one would have the problem if the index can be trusted Michi > incrediBILL seems to have a pretty good point. > > -----Original Message----- > From: Michael Wechner [mailto:[EMAIL PROTECTED] > Sent: Thursday, June 15, 2006 12:30 AM > To: [email protected] > Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? > > Doug Cutting wrote: > > http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm > l > >> > well, I think incrediBILL has an argument, that people might really > start excluding bots from their servers if it's > becoming too much. What might help is that incrediBILL would offer an > index of the site, which should be smaller > than the site itself. I am not sure if there exists a "standard" for > something like this. Basically the bot would ask the > server if an index exists and where it is located and what the date it > is from and then the bot decides to download the index > or otherwise starts crawling the site. > > Michi > > -- Michael Wechner Wyona - Open Source Content Management - Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED] [EMAIL PROTECTED] +41 44 272 91 61 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
