Could anyone give some link or document about the nutch's index algorithm? I
don't found many ones...
Regards
--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]
Hi,
the hadoop io system is read only, so you can not update a file.
However if I'm sure you can hack the link db creation code and add
the url filter that is already used for the crawldb.
May be this is already in the code, if not it would be good since it
minimize spam links to take effect
Yes! It really works! I'm execunting the recrawl at now, and it is fetching
the pages that it didn't fetched yet... It takes longer, but the final
result is more important.
Thanks a lot!
On 7/7/06, Honda-Search Administrator <[EMAIL PROTECTED]> wrote:
This is typical if you are crawling only a
This is typical if you are crawling only a few sites. I crawl 7 sites
nightly and often get this error. I changed my http.max.delays property
from 3 to 50 and it works without a problem. The crawl takes longer, but I
get almost all of the pages.
- Original Message -
From: "Lourival
Hi all!
I have a little doubt. My WebDB contains, actually, 779 pages with 899
links. When I use the segread command it returns 779 count pages too in one
segment. However when I make a search or when I use the luke software the
maximum number of documents is 437. I've seen the recrawl logs and w
Otis,
Check out the purge tool (bin/nutch purge)
It's easy to remove URLS individually or based on regular expressions, but
you'll need to learn lucrene syntax to do it.
It will remove certain pages from the index, but won't exclude them from
being recrawled the next time around. For that y
Hi,Stefan,
thanks your reply.
i've tried a 20 depth and it works better,it can crawl almost all the
pages. however it have not crawled all pages yet.
i'll try a bigger depth like 30 later...
Stefan Groschupf 写道:
Hi,
may be you can try to have a much higher depth something like 20?
However in
Thanks Stefan.
So one has to iterate and re-write the whole graph, and there is no way to just
modify it on the fly by, for example, removing specific links/pages?
Thanks,
Otis
- Original Message
From: Stefan Groschupf <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Friday, J
On Wed, 2006-07-05 at 20:32 -0700, Stefan Groschupf wrote:
> Crawler & Co. are command line tools.
> The servletcontainer is only used to deliver search results but you
> can use the servlet that just provides XML.
Ah, excellent. Thanks for letting me avoid reading the manual ;)
> > It would be