date:20050802

Re: nutch prune

2005-08-02 Thread Matthias Jaekle

Hi Jay, I think with the current version you could only prune segments. We have once written a class to prune the db. Maybe you could use this and add a function to delete pages according to the urlfilter. I have attached our class. Matthias -- http://www.eventax.com - eventax GmbH http://www.um

Fetcher delays - benchmarks

2005-08-02 Thread Christophe Noel

Hello, Following to some discussions, developpers mails, ... I tried to get the best performances (pages/second) for the following case : - 120 web servers to crawl - 10 Mbits/s connexion I reached about 3 Mbits/s average fetching speed with following parameters (unpolite mode) : - fetcher

Re: Fetcher delays - benchmarks

2005-08-02 Thread Jay Pound

I'm able to easily saturate my 10mbit connx, but it takes a powerful computer, if your computer is not so powerful try to fetch with the -noParsing flag, it will offload the parsing processing untill later, even a quad pentium 3 xeon 700mhz with 4gb of ram can only saturate about 5mbit, I've used 3

[jira] Aktualisiert: (NUTCH-21) parser plugin for MS PowerPoint slides

2005-08-02 Thread Stephan Strittmatter (JIRA)

[ http://issues.apache.org/jira/browse/NUTCH-21?page=all ] Stephan Strittmatter updated NUTCH-21: -- Attachment: parse-mspowerpoint.zip Updated plugin sources in respect of changed Nutch interface > parser plugin for MS PowerPoint slides > --

[jira] Aktualisiert: (NUTCH-20) Extract urls from plain texts

2005-08-02 Thread Stephan Strittmatter (JIRA)

[ http://issues.apache.org/jira/browse/NUTCH-20?page=all ] Stephan Strittmatter updated NUTCH-20: -- Description: Some parsers have no Outlinks returned. E.g. the Word-Parser. This class is able to extract (absolute) hyperlinks from a plain String (c

[jira] Aktualisiert: (NUTCH-20) Extract urls from plain texts

2005-08-02 Thread Stephan Strittmatter (JIRA)

[ http://issues.apache.org/jira/browse/NUTCH-20?page=all ] Stephan Strittmatter updated NUTCH-20: -- Attachment: OutlinkExtractor.java anchor "null" causes NPE. changed to anchor as empty String. > Extract urls from plain texts > ---

[jira] Erstellt: (NUTCH-77) Project URL in JIRA

2005-08-02 Thread Stephan Strittmatter (JIRA)

Project URL in JIRA --- Key: NUTCH-77 URL: http://issues.apache.org/jira/browse/NUTCH-77 Project: Nutch Type: Task Reporter: Stephan Strittmatter Priority: Trivial The project URL on JIRA should be updated from http://incubator.apache.or

Re: Fetcher delays - benchmarks

2005-08-02 Thread Christophe Noel

Ok thank you very much. Something strange : i tried with 1600 threads () instead of 800 and it goes from 2,5 Mbits (average) to 5 ... Isn't these parameter (1600 threads) really to big numbers ??? Jay Pound wrote: I'm able to easily saturate my 10mbit connx, but it takes a powerful com

Re: Fetcher delays - benchmarks

2005-08-02 Thread Jay Pound

yeah, I can saturate 10 mbit with 150 threads, any more and webpages will drop. check your error rate when downloading if your having a lot of error pages then drop back your threads to a lower number. your going to have to find the sweet spot for your connection!!! -J - Original Message -

Re: Detecting CJKV / Asian language pages

2005-08-02 Thread Gavin Thomas Nicol

On Aug 1, 2005, at 5:31 PM, Ken Krugler wrote: Or you can derive the language from the host URL, if it includes a country code. That's not really sufficient... many Japanese sites also have pages in English. Actually, that's true for most non-English sites from what I've seen. It's har

Re: Detecting CJKV / Asian language pages

2005-08-02 Thread Ken Krugler

On Aug 1, 2005, at 5:31 PM, Ken Krugler wrote: Or you can derive the language from the host URL, if it includes a country code. That's not really sufficient... many Japanese sites also have pages in English. Actually, that's true for most non-English sites from what I've seen. Yes - this i

Memory usage

2005-08-02 Thread Jay Pound

I'm testing an index of 30 million pages, it requires 1.5gb of ram to search using tomcat 5, I plan on having an index with multiple billion pages, but if this is to scale then even with 16GB of ram I wont be able to have an index larger than 320million pages? how can I distribute the memory requir

Re: Memory usage

2005-08-02 Thread Andy Liu

How do you figure that it takes 1.5G ram for 30M pages? I believe that when the Lucene indexes are read, it reads all the numbered *.f* files and the *.tii files into memory. The numbered *.f* files contain the length normalization values for each indexed field (1 byte per doc), and the .tii file

Re: Memory usage

2005-08-02 Thread Doug Cutting

Try the following settings in your nutch-site.xml: io.map.index.skip 7 indexer.termIndexInterval 1024 The first causes data files to use considerably less memory. The second affects index creation, so must be done before you create the index you search. It's okay if your segment

Re: Detecting CJKV / Asian language pages

2005-08-02 Thread Gavin Thomas Nicol

On Aug 2, 2005, at 11:55 AM, Ken Krugler wrote: Yes - small chunks of untagged text are going to be a problem, no matter what you do. But if you're referring to query strings from an HTML page, the default is to use the encoding of the page (which from Nutch defaults to UTF-8). And you can

Re: Memory usage2

2005-08-02 Thread Jay Pound

whats the bottleneck for the slow searching, I'm monitoring it and its doing about 57% cpu load when I'm searching , it takes about 50secs to bring up the results page the first time, then if I search for the same thing again its much faster. Doug, can I trash my segments after they are indexed, I

Re: Memory usage2

2005-08-02 Thread Fredrik Andersson

Hi Jay! Why not use the "Google approach" and buy lots of cheap workstations/servers to distribute the search on? You can really get away cheap these days, compared to high-end servers. Even if NDFS and isn't fully up to par in 0.7-dev yet, you can still move your indices around to separate comput

Re: Memory usage2

2005-08-02 Thread Andy Liu

I have found that merging indexes does help performance significantly. If you're not using the cached pages for anything, I believe you can delete the /content directory for each segment and the engine should work fine (test before you try for real!) However, if you ever have to reindex the segme

RE: Memory usage2

2005-08-02 Thread EM

Why isn't 'analyze' supported anymore? -Original Message- From: Andy Liu [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 02, 2005 5:44 PM To: nutch-dev@lucene.apache.org Subject: Re: Memory usage2 I have found that merging indexes does help performance significantly. If you're not using

Re: Detecting CJKV / Asian language pages

2005-08-02 Thread Ken Krugler

On Aug 2, 2005, at 11:55 AM, Ken Krugler wrote: Yes - small chunks of untagged text are going to be a problem, no matter what you do. But if you're referring to query strings from an HTML page, the default is to use the encoding of the page (which from Nutch defaults to UTF-8). And you can use

Re: Detecting CJKV / Asian language pages

2005-08-02 Thread Gavin Thomas Nicol

On Aug 2, 2005, at 6:03 PM, Ken Krugler wrote: Thanks for your work in this area! I assume it's RFC 2070 :) Yes. :-) 1. Server doesn't provide any charset info. Very common in my experience. 2. Server provides incorrect charset info. a. Charset is a subset (e.g. 8859-1 vs. 1252)

My wishlist of 12 out of...

2005-08-02 Thread EM

I've been using Nutch for quite a while and reading this list constantly. I'll state some assumptions in this post about the way Nutch operates, if they are wrong, please excuse my ignorance. I'm using the interfaces extensively, so I will assume things behind them. I think I'm the average user:

Strange search results

2005-08-02 Thread Howie Wang

Hi, I've been noticing some strange search results recently. I seem to be getting two issues. 1. The fieldNorm for certain terms is unusually high for certain sites for anchors and titles. And they are usually just whole numbers (4.0, 5.0, etc). I find this strange since the lengthNorm used to

Re: nutch prune

Fetcher delays - benchmarks

Re: Fetcher delays - benchmarks

[jira] Aktualisiert: (NUTCH-21) parser plugin for MS PowerPoint slides

[jira] Aktualisiert: (NUTCH-20) Extract urls from plain texts

[jira] Aktualisiert: (NUTCH-20) Extract urls from plain texts

[jira] Erstellt: (NUTCH-77) Project URL in JIRA

Re: Fetcher delays - benchmarks

Re: Fetcher delays - benchmarks

Re: Detecting CJKV / Asian language pages

Re: Detecting CJKV / Asian language pages

Memory usage

Re: Memory usage

Re: Memory usage

Re: Detecting CJKV / Asian language pages

Re: Memory usage2

Re: Memory usage2

Re: Memory usage2

RE: Memory usage2

Re: Detecting CJKV / Asian language pages

Re: Detecting CJKV / Asian language pages

My wishlist of 12 out of...

Strange search results

23 matches

Site Navigation

Mail list logo

Footer information