Re: [Nutch-general] Is fetcher.throttle.bandwidth known to work?

2007-06-04 Thread Andrzej Bialecki
Enzo Michelangeli wrote: - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Monday, June 04, 2007 1:31 AM Enzo Michelangeli wrote: In my case (with Nutch 0.8), it seems not: I set it to 500, and the fetcher still saturates the 1.5 Mbit/s link... Is it supposed to

Re: [Nutch-general] Error with the inject command

2007-06-04 Thread Vishal Shah
Hi Berlin, Nutch needs a file called urls.txt inside the directory that you are passing to the inject command. Try renaming the urls file to urls.txt. Also, are you using the local FS or hadoop dfs? If it's the latter, you'll have to put your dmoz directory on the hadoop fs. -vishal.

[Nutch-general] Number of Pages

2007-06-04 Thread carmmello
As I indexed about 600 sites with Nutch 0.9, I noticed that, at least, one of them were showing less results than expected. This site was www.nrc.gov. As a test I tried to index only the NRC site, allowing only internal links in site.xml conf. file, using crawl-urlfiter.txt with

Re: [Nutch-general] Nutch 0.9 and Crawl-Delay

2007-06-04 Thread Ken Krugler
Hi Lutz, I have had problems with a Nutch based robot during the last 12 hours, which I have now solved by banning this particular bot from my server (not Nutch completely for the moment). The ilial bot, which created considerable load on my server, was using the latest Nutch version - v0.9 -

[Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-04 Thread Enzo Michelangeli
I have a question about the loading mechanism of plugin classes. I'm working with a custom URLFilter, and I need a singleton object loaded and initialized by the first instance of the URLFilter, and shared by other instances (e.g., instantiated by other threads). I was assuming that the

[Nutch-general] field collapsing impl

2007-06-04 Thread Yonik Seeley
Some people over in Solr-land are developing generic field collapsing https://issues.apache.org/jira/browse/SOLR-236 and I thought I should check if you guys have any good ideas about it. How does Nutch implement this for grouping results by site (like google does)? -Yonik

Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-04 Thread Enzo Michelangeli
Additionally, the shutDown() method (that shadows the one in org.apache.nutch.plugin.Plugin) appears never to be called, even if System.runFinalizersOnExit(true) (which is deprecated as dangerous) was previously invoked. The only way of having my shutdown code executed seems to be to place it

[Nutch-general] Complex problem of recrawling economically

2007-06-04 Thread Manoharam Reddy
Hi, I am trying to solve a problem but I am unable to find any feature in Nutch that lets me solve this problem. Let's say in my intranet there are 1000 sites. Sites 1 to 100 have pages that are never going to change, i.e. they are static. So I don't need to crawl them again and again. But