Index algorithm

2006-07-07 Thread Lourival Júnior
Could anyone give some link or document about the nutch's index algorithm? I don't found many ones... Regards -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]

Re: [Nutch-general] Link db (traversal + modification)

2006-07-07 Thread Stefan Groschupf
Hi, the hadoop io system is read only, so you can not update a file. However if I'm sure you can hack the link db creation code and add the url filter that is already used for the crawldb. May be this is already in the code, if not it would be good since it minimize spam links to take effect

Re: Number of pages different to number of indexed pages

2006-07-07 Thread Lourival Júnior
Yes! It really works! I'm execunting the recrawl at now, and it is fetching the pages that it didn't fetched yet... It takes longer, but the final result is more important. Thanks a lot! On 7/7/06, Honda-Search Administrator <[EMAIL PROTECTED]> wrote: This is typical if you are crawling only a

Re: Number of pages different to number of indexed pages

2006-07-07 Thread Honda-Search Administrator
This is typical if you are crawling only a few sites. I crawl 7 sites nightly and often get this error. I changed my http.max.delays property from 3 to 50 and it works without a problem. The crawl takes longer, but I get almost all of the pages. - Original Message - From: "Lourival

Number of pages different to number of indexed pages

2006-07-07 Thread Lourival Júnior
Hi all! I have a little doubt. My WebDB contains, actually, 779 pages with 899 links. When I use the segread command it returns 779 count pages too in one segment. However when I make a search or when I use the luke software the maximum number of documents is 437. I've seen the recrawl logs and w

Re: [Nutch-general] Link db (traversal + modification)

2006-07-07 Thread Honda-Search Administrator
Otis, Check out the purge tool (bin/nutch purge) It's easy to remove URLS individually or based on regular expressions, but you'll need to learn lucrene syntax to do it. It will remove certain pages from the index, but won't exclude them from being recrawled the next time around. For that y

Re: why i can't crawl all the linked pages in the specified page to crawl.

2006-07-07 Thread kevin
Hi,Stefan, thanks your reply. i've tried a 20 depth and it works better,it can crawl almost all the pages. however it have not crawled all pages yet. i'll try a bigger depth like 30 later... Stefan Groschupf 写道: Hi, may be you can try to have a much higher depth something like 20? However in

Re: [Nutch-general] Link db (traversal + modification)

2006-07-07 Thread ogjunk-nutch
Thanks Stefan. So one has to iterate and re-write the whole graph, and there is no way to just modify it on the fly by, for example, removing specific links/pages? Thanks, Otis - Original Message From: Stefan Groschupf <[EMAIL PROTECTED]> To: nutch-user@lucene.apache.org Sent: Friday, J

Re: Alternatives

2006-07-07 Thread karl wettin
On Wed, 2006-07-05 at 20:32 -0700, Stefan Groschupf wrote: > Crawler & Co. are command line tools. > The servletcontainer is only used to deliver search results but you > can use the servlet that just provides XML. Ah, excellent. Thanks for letting me avoid reading the manual ;) > > It would be