RE: Impressive performance

2005-07-07 Thread Emilijan Mirceski
>From my experience, when indexing, disk speed is the limiting factor once your computer has several GHz to work with. -Original Message- From: Vacuum Joe [mailto:[EMAIL PROTECTED] Sent: Thursday, July 07, 2005 11:23 PM To: nutch-user@lucene.apache.org Subject: Impressive performance I

nutch blocking

2005-07-05 Thread Emilijan Mirceski
I have a sort of a bad problem with nutch I cannot get rid lately. I read PLENTY online and offline about tomcat, and nutch, and java, but still I'm just stuck. So, if anyone has an idea how to solve it, please let me know. The problem is in the following: Used software: - Jacarta-tomcat-4.1.31 -

RE: Newbie questions

2005-07-05 Thread Emilijan Mirceski
Hello there, If I'm wrong on anything, somebody correct me please. Segments are there to store the pages you've downloaded. You can have the same pages downloaded in two or more segments. There is a default refresh setting (30 days, see conf directory, xml files) Lets assume that your crawl sco

regex url filter

2005-06-30 Thread Emilijan Mirceski
If in my regex-urlfilter: >> # skip URLs containing certain characters as probable queries, etc. >> [EMAIL PROTECTED] i skip '?' and '=', I will have more pages in my database. Is there any strong reason why this was disabled in the release version? (My segments have about ~100 thousand pages

RE: recursion: see recursion

2005-06-30 Thread Emilijan Mirceski
Problem solved by an appropriate regex query. The reason for the problem is some strange combination of java code and urls. -Original Message- From: Emilijan Mirceski [mailto:[EMAIL PROTECTED] Sent: Thursday, June 30, 2005 3:40 PM To: nutch-user@lucene.apache.org Subject: recursion: see

recursion: see recursion

2005-06-30 Thread Emilijan Mirceski
Lately, I'm receiving 1000's variations of the following: 050630 153456 fetching http://www.idividi.com.mk/vesti/makedonija/Politika/315216/mt.net.mk/mt.net. .k/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.mk/mt.net.m k/mt.net.mk/mt.net.mk 050630 153457 Response content length is not known 050630 153458