Re: nutch prune

2005-08-02 Thread Matthias Jaekle
Hi Jay, I think with the current version you could only prune segments. We have once written a class to prune the db. Maybe you could use this and add a function to delete pages according to the urlfilter. I have attached our class. Matthias -- http://www.eventax.com - eventax GmbH http://www.um

Fetcher delays - benchmarks

2005-08-02 Thread Christophe Noel
Hello, Following to some discussions, developpers mails, ... I tried to get the best performances (pages/second) for the following case : - 120 web servers to crawl - 10 Mbits/s connexion I reached about 3 Mbits/s average fetching speed with following parameters (unpolite mode) : - fetcher

Re: Fetcher delays - benchmarks

2005-08-02 Thread Jay Pound
I'm able to easily saturate my 10mbit connx, but it takes a powerful computer, if your computer is not so powerful try to fetch with the -noParsing flag, it will offload the parsing processing untill later, even a quad pentium 3 xeon 700mhz with 4gb of ram can only saturate about 5mbit, I've used 3

[jira] Aktualisiert: (NUTCH-21) parser plugin for MS PowerPoint slides

2005-08-02 Thread Stephan Strittmatter (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-21?page=all ] Stephan Strittmatter updated NUTCH-21: -- Attachment: parse-mspowerpoint.zip Updated plugin sources in respect of changed Nutch interface > parser plugin for MS PowerPoint slides > --

[jira] Aktualisiert: (NUTCH-20) Extract urls from plain texts

2005-08-02 Thread Stephan Strittmatter (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-20?page=all ] Stephan Strittmatter updated NUTCH-20: -- Description: Some parsers have no Outlinks returned. E.g. the Word-Parser. This class is able to extract (absolute) hyperlinks from a plain String (c

[jira] Aktualisiert: (NUTCH-20) Extract urls from plain texts

2005-08-02 Thread Stephan Strittmatter (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-20?page=all ] Stephan Strittmatter updated NUTCH-20: -- Attachment: OutlinkExtractor.java anchor "null" causes NPE. changed to anchor as empty String. > Extract urls from plain texts > ---

[jira] Erstellt: (NUTCH-77) Project URL in JIRA

2005-08-02 Thread Stephan Strittmatter (JIRA)
Project URL in JIRA --- Key: NUTCH-77 URL: http://issues.apache.org/jira/browse/NUTCH-77 Project: Nutch Type: Task Reporter: Stephan Strittmatter Priority: Trivial The project URL on JIRA should be updated from http://incubator.apache.or

Re: Fetcher delays - benchmarks

2005-08-02 Thread Christophe Noel
Ok thank you very much. Something strange : i tried with 1600 threads () instead of 800 and it goes from 2,5 Mbits (average) to 5 ... Isn't these parameter (1600 threads) really to big numbers ??? Jay Pound wrote: I'm able to easily saturate my 10mbit connx, but it takes a powerful com

Re: Fetcher delays - benchmarks

2005-08-02 Thread Jay Pound
yeah, I can saturate 10 mbit with 150 threads, any more and webpages will drop. check your error rate when downloading if your having a lot of error pages then drop back your threads to a lower number. your going to have to find the sweet spot for your connection!!! -J - Original Message -

Re: Detecting CJKV / Asian language pages

2005-08-02 Thread Gavin Thomas Nicol
On Aug 1, 2005, at 5:31 PM, Ken Krugler wrote: Or you can derive the language from the host URL, if it includes a country code. That's not really sufficient... many Japanese sites also have pages in English. Actually, that's true for most non-English sites from what I've seen. It's har

Re: Detecting CJKV / Asian language pages

2005-08-02 Thread Ken Krugler
On Aug 1, 2005, at 5:31 PM, Ken Krugler wrote: Or you can derive the language from the host URL, if it includes a country code. That's not really sufficient... many Japanese sites also have pages in English. Actually, that's true for most non-English sites from what I've seen. Yes - this i

Memory usage

2005-08-02 Thread Jay Pound
I'm testing an index of 30 million pages, it requires 1.5gb of ram to search using tomcat 5, I plan on having an index with multiple billion pages, but if this is to scale then even with 16GB of ram I wont be able to have an index larger than 320million pages? how can I distribute the memory requir

Re: Memory usage

2005-08-02 Thread Andy Liu
How do you figure that it takes 1.5G ram for 30M pages? I believe that when the Lucene indexes are read, it reads all the numbered *.f* files and the *.tii files into memory. The numbered *.f* files contain the length normalization values for each indexed field (1 byte per doc), and the .tii file

Re: Memory usage

2005-08-02 Thread Doug Cutting
Try the following settings in your nutch-site.xml: io.map.index.skip 7 indexer.termIndexInterval 1024 The first causes data files to use considerably less memory. The second affects index creation, so must be done before you create the index you search. It's okay if your segment

Re: Detecting CJKV / Asian language pages

2005-08-02 Thread Gavin Thomas Nicol
On Aug 2, 2005, at 11:55 AM, Ken Krugler wrote: Yes - small chunks of untagged text are going to be a problem, no matter what you do. But if you're referring to query strings from an HTML page, the default is to use the encoding of the page (which from Nutch defaults to UTF-8). And you can

Re: Memory usage2

2005-08-02 Thread Jay Pound
whats the bottleneck for the slow searching, I'm monitoring it and its doing about 57% cpu load when I'm searching , it takes about 50secs to bring up the results page the first time, then if I search for the same thing again its much faster. Doug, can I trash my segments after they are indexed, I

Re: Memory usage2

2005-08-02 Thread Fredrik Andersson
Hi Jay! Why not use the "Google approach" and buy lots of cheap workstations/servers to distribute the search on? You can really get away cheap these days, compared to high-end servers. Even if NDFS and isn't fully up to par in 0.7-dev yet, you can still move your indices around to separate comput

Re: Memory usage2

2005-08-02 Thread Andy Liu
I have found that merging indexes does help performance significantly. If you're not using the cached pages for anything, I believe you can delete the /content directory for each segment and the engine should work fine (test before you try for real!) However, if you ever have to reindex the segme

RE: Memory usage2

2005-08-02 Thread EM
Why isn't 'analyze' supported anymore? -Original Message- From: Andy Liu [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 02, 2005 5:44 PM To: nutch-dev@lucene.apache.org Subject: Re: Memory usage2 I have found that merging indexes does help performance significantly. If you're not using

Re: Detecting CJKV / Asian language pages

2005-08-02 Thread Ken Krugler
On Aug 2, 2005, at 11:55 AM, Ken Krugler wrote: Yes - small chunks of untagged text are going to be a problem, no matter what you do. But if you're referring to query strings from an HTML page, the default is to use the encoding of the page (which from Nutch defaults to UTF-8). And you can use

Re: Detecting CJKV / Asian language pages

2005-08-02 Thread Gavin Thomas Nicol
On Aug 2, 2005, at 6:03 PM, Ken Krugler wrote: Thanks for your work in this area! I assume it's RFC 2070 :) Yes. :-) 1. Server doesn't provide any charset info. Very common in my experience. 2. Server provides incorrect charset info. a. Charset is a subset (e.g. 8859-1 vs. 1252)

My wishlist of 12 out of...

2005-08-02 Thread EM
I've been using Nutch for quite a while and reading this list constantly. I'll state some assumptions in this post about the way Nutch operates, if they are wrong, please excuse my ignorance. I'm using the interfaces extensively, so I will assume things behind them. I think I'm the average user:

Strange search results

2005-08-02 Thread Howie Wang
Hi, I've been noticing some strange search results recently. I seem to be getting two issues. 1. The fieldNorm for certain terms is unusually high for certain sites for anchors and titles. And they are usually just whole numbers (4.0, 5.0, etc). I find this strange since the lengthNorm used to