Enzo Michelangeli wrote:
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Monday, June 04, 2007 1:31 AM
Enzo Michelangeli wrote:
In my case (with Nutch 0.8), it seems not: I set it to 500, and the
fetcher still saturates the 1.5 Mbit/s link... Is it supposed to
Hi Berlin,
Nutch needs a file called urls.txt inside the directory that you are
passing to the inject command. Try renaming the urls file to urls.txt.
Also, are you using the local FS or hadoop dfs? If it's the latter, you'll
have to put your dmoz directory on the hadoop fs.
-vishal.
As I indexed about 600 sites with Nutch 0.9, I noticed that, at least, one of
them were showing less results than expected. This site was www.nrc.gov. As a
test I tried to index only the NRC site, allowing only internal links in
site.xml conf. file, using crawl-urlfiter.txt with
Hi Lutz,
I have had problems with a Nutch based robot during the last 12 hours,
which I have now solved by banning this particular bot from my server
(not Nutch completely for the moment). The ilial bot, which created
considerable load on my server, was using the latest Nutch version -
v0.9 -
I have a question about the loading mechanism of plugin classes. I'm working
with a custom URLFilter, and I need a singleton object loaded and
initialized by the first instance of the URLFilter, and shared by other
instances (e.g., instantiated by other threads). I was assuming that the
Some people over in Solr-land are developing generic field collapsing
https://issues.apache.org/jira/browse/SOLR-236
and I thought I should check if you guys have any good ideas about it.
How does Nutch implement this for grouping results by site (like google does)?
-Yonik
Additionally, the shutDown() method (that shadows the one in
org.apache.nutch.plugin.Plugin) appears never to be called, even if
System.runFinalizersOnExit(true) (which is deprecated as dangerous) was
previously invoked.
The only way of having my shutdown code executed seems to be to place it
Hi,
I am trying to solve a problem but I am unable to find any feature in
Nutch that lets me solve this problem.
Let's say in my intranet there are 1000 sites.
Sites 1 to 100 have pages that are never going to change, i.e. they
are static. So I don't need to crawl them again and again. But