>From my experience, when indexing, disk speed is the limiting factor once
your computer has several GHz to work with.
-Original Message-
From: Vacuum Joe [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 07, 2005 11:23 PM
To: nutch-user@lucene.apache.org
Subject: Impressive performance
I
I have a sort of a bad problem with nutch I cannot get rid lately. I read
PLENTY online and offline about tomcat, and nutch, and java, but still I'm
just stuck. So, if anyone has an idea how to solve it, please let me know.
The problem is in the following:
Used software:
- Jacarta-tomcat-4.1.31
-
Hello there,
If I'm wrong on anything, somebody correct me please.
Segments are there to store the pages you've downloaded.
You can have the same pages downloaded in two or more segments.
There is a default refresh setting (30 days, see conf directory, xml files)
Lets assume that your crawl sco
If in my regex-urlfilter:
>> # skip URLs containing certain characters as probable queries, etc.
>> [EMAIL PROTECTED]
i skip '?' and '=', I will have more pages in my database.
Is there any strong reason why this was disabled in the release version?
(My segments have about ~100 thousand pages
Problem solved by an appropriate regex query. The reason for the problem is
some strange combination of java code and urls.
-Original Message-
From: Emilijan Mirceski [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 30, 2005 3:40 PM
To: nutch-user@lucene.apache.org
Subject: recursion: see
Lately, I'm receiving 1000's variations of the following:
050630 153456 fetching
http://www.idividi.com.mk/vesti/makedonija/Politika/315216/mt.net.mk/mt.net.
.k/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.m
k/mt.net.mk/mt.net.mk
050630 153457 Response content length is not known
050630 153458