Re: [Nutch-general] RE: new location! nutch user meeting San Francisco

2006-05-16 Thread Stefan Groschupf
Hi, no Agenda see: http://www.evite.com/app/publicUrl/[EMAIL PROTECTED]/nutch-1 Stefan

Re: robot exclusion portional of a document

2006-05-16 Thread Jérôme Charron
As far as I understand, /robots.txt designates which files may and may not be indexed by the Nutch and other crawlers. However, is there a method by which site may exclude only sections of a document? Some methods I've seen include: If there is no such feature and this is deemed useful, I would

query term for searching directories of a site?

2006-05-16 Thread Lance Birtcil
Is there any way to limit searches to a particular directory structure in an index using query terms? For instance, let's say that I have a single index created for my intranet site, a.b.c., and the site contains these directories: http://a.b.c/dir1/some/others http://a.b.c/dir2 Is there

Re: Generalte/Fetch/Update - urgent issue?

2006-05-16 Thread Andrzej Bialecki
Lukas Vlcek wrote: Hi, I am using nutch0.8-dev. I have a small shell script for generate/fetch/update cycle. I used generate command with -topN 500. After crawling about 2000 pages I changed -topN to 3 (yes three pages only) to see what pages are crawled. I found that generate/fetch/update cycl

Re: changing ranking

2006-05-16 Thread Andrzej Bialecki
Eugen Kochuev wrote: Hi guys, I have a catalogue of the sites where domains are ranked by human experts. Is it possible to tweak the score of pages belonging to the domains listed in the catalogue according to their catalogue rank? So, I'm interested in the ability to change scores of s

changing ranking

2006-05-16 Thread Eugen Kochuev
Hi guys, I have a catalogue of the sites where domains are ranked by human experts. Is it possible to tweak the score of pages belonging to the domains listed in the catalogue according to their catalogue rank? So, I'm interested in the ability to change scores of some urls. -- Best reg

robot exclusion portional of a document

2006-05-16 Thread Alexander E Genaud
Hello, As far as I understand, /robots.txt designates which files may and may not be indexed by the Nutch and other crawlers. However, is there a method by which site may exclude only sections of a document? The benefit is most evident in the search hit result description (snippets) which will o