Re: webdb - "orphaned" pages?

2005-08-12 Thread Raymond Creel
the page > if page becomes > unavailable for some number of fetch attempts. > Regards > Piotr > > On 8/10/05, Raymond Creel <[EMAIL PROTECTED]> > wrote: > > I have a question about the webdb and fetching. > When > > a page that used to have incoming links i

webdb - "orphaned" pages?

2005-08-09 Thread Raymond Creel
I have a question about the webdb and fetching. When a page that used to have incoming links is found to be "orphaned" (i.e. there are no longer any pages that have links to it), is it deleted from the webdb? Or is it left in the webdb but set not to be refetched? Or will it continue to be refet

Re: [Nutch-general] html parser + relative urls

2005-07-27 Thread Raymond Creel
lp raymond --- [EMAIL PROTECTED] wrote: > I think Nutch is behaving correctly. > Maybe that page has a BASE URL (view source, look at > the HEAD elements) > that throws off one or the other. > > Otis > > > --- Raymond Creel <[EMAIL PROTECTED]> wrote: > > &g

RE: fetch bandwidth settings

2005-07-27 Thread Raymond Creel
> What website are you working on? Many different ones, each with their own nutch configurations, which is why I'm trying to figure out how to tweak the fetcher so it maximizes speed while minimizing errors and webmaster annoyance. :) Currently it seems to be working pretty well with just using

html parser + relative urls

2005-07-27 Thread Raymond Creel
Has any one experience a problem with the way the standard html parser plugin handles relative urls? There is a site where the home page is something like http://www.x.com/x.cgi and when browsing a link with its href set to '?paramname=paramvalue' a browser will naturally take you to

RE: fetch bandwidth settings

2005-07-26 Thread Raymond Creel
d the > target server will save > on bandwidth in fact ;) > > > -Original Message- > From: Raymond Creel [mailto:[EMAIL PROTECTED] > Sent: Monday, July 25, 2005 4:00 PM > To: nutch-user@lucene.apache.org > Subject: fetch bandwidth settings > > I have read th

fetch bandwidth settings

2005-07-25 Thread Raymond Creel
I have read that you don't want to make more than 1 or 2 requests per second to the same host, or else you will start adversely affecting their bandwidth. Is this a good rule of thumb? Along those lines, what would be the best values to put in the nutch config file to maximize speed of fetching

Re: nutch config files

2005-07-11 Thread Raymond Creel
Ah yes, thank you - this will work nicely! --- Howie Wang <[EMAIL PROTECTED]> wrote: > >What I really would like is a way to pass in the > >location of the config files (e.g. > nutch-default.xml, > >regex-urlfilter.txt, etc.) as an argument to the > nutch > >script, so that I can have multiple co

Re: nutch config files

2005-07-08 Thread Raymond Creel
used a mailing list in awhile. --- Juho Mäkinen <[EMAIL PROTECTED]> wrote: > Take a look into Nutch Wiki FAQ here: > http://wiki.apache.org/nutch/FAQ > And find the Q/A for "How can I force fetcher to use > custom nutch-config?" > > - Juho Mäkinen, http://w

nutch config files

2005-07-07 Thread Raymond Creel
tions. Thanks, Raymond Creel Sell on Yahoo! Auctions – no fees. Bid on great items. http://auctions.yahoo.com/

nutch config files

2005-07-07 Thread Raymond Creel
tions. Thanks, Raymond Creel __ Do you Yahoo!? Read only the mail you want - Yahoo! Mail SpamGuard. http://promotions.yahoo.com/new_mail