why is segslice so slow?

2005-10-15 Thread EM
segslice usually performs 200-300 records/sec on my machine (quite fast for everything else, top of the line). Is it just copying the segments minus the last part or some processing is required for each record? Any advise how can it be optimized?

Re: [jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-14 Thread EM
It is possible to configure Linux box (1Mb RAM) with 6000 client threads in Worker model. It is limited only by amout of available RAM. I used such configuration in production, 6 Apache servers sustained 75000 of concurrent users performing 1 request per minute, 4kb HTML pages, load/stress te

suspicious outlink count

2005-10-12 Thread EM
202443 Pages consumed: 13 (at index 13). Links fetched: 233386. 202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/]. 202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315. If there is maxoutlinks already specified in the xml config, why does nutch bother c

Re: [jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-12 Thread EM
We have network equipment limitations too, we can't reach more than 65000 threads over single LAN card, and JVM is good (but better is to have multiple JVM/processes, 100 threads each...) 65000 threads? What are you trying to fetch? The whole web? Otis 65000/100 =650 processes.

Re: Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread EM
imes, what information is available to you is determined by the decision of whoever designed the page right? It the page tries to be smart and 'determine' what user want's to see, well, if you don't own that webpage there isn't anything you can do. Best regards, EM

Re: Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread EM
't decreasing ;) I've encountered cases like this, and instead of manually typing regex to clean them off (which takes time) I'd strongly prefer an automated solution if possible. Regards, EM

Re: Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread EM
r crawler will run over any website, with 50-500 threads the default three retry times, and the problem will solve itself out. But, can something be done for the rest or us, please? A simple 500 would really be appreciated.. Regards, EM Transbuerg Tian wrote: I have the same conditions li

Re: how to deal with large/slow sites

2005-09-12 Thread EM
Doug Cutting wrote: > (2) Will the dropped urls be picked up again in subsequent cycles of fetchlist/segment/fetch/updatedb? They will be retried in the next cycle, up to db.fetch.retry.max. After the next bin/nutch generate... or are you using 'cycle' for something else? EM

RE: fetcher question: why multithreaded?

2005-09-05 Thread EM
I'm currently fetching with 35 threads. The CPU load is about 5-10% (P4 3.0 HT). Parsing obviously isn't using many resources. Removing parsing also would not speed up the fetching process. If parsing (while fetching) is removed (with a command line argument), I'll probably tune the fetcher down t

RE: [jira] Commented: (NUTCH-85) pdf parser caused fetcher hangs.

2005-08-25 Thread EM
So, just replace PDFBox-0.7.2-dev.jar from the plugin directory with the PDFBox-0.7.2-dev-20050825.jar (Renaming the file of course.) ? Regards, EM -Original Message- From: Andrzej Bialecki (JIRA) [mailto:[EMAIL PROTECTED] Sent: Thursday, August 25, 2005 6:28 AM To: nutch-dev

Field.Text vs Field.UnStored

2005-08-11 Thread EM
as: String content = parse.getText(); content +=" "; content += myTranslationFunctionToLatin(content); doc.add (Field.Text("content", content); Or would the last line be: doc.add(Field.UnStored("content", content)); What's the difference with regard to the Field.* object? Regards, EM

strange url counting in the fetcher

2005-08-09 Thread EM
Should the following be happening? Short description: -fetch bunch of pages. -status: 5400 fetched, 27 errors -fetch 22 more pages -status: 5403 fetched, 27 errors My regex-urlfilter excludes "jpg" and includes "?" Long description: 050809 093001 fetching http:///.jpg?6351 050809 093001 fetchin

fetching redirect bug?

2005-08-05 Thread EM
Suppose we have to fetch 3 pages. Page A is http://something/login.php Page B is http://yyy/rrr/ which, when fetched, redirects to page A Page C is http://yyy/ttt/ which, when fetched, redirects to page A When fetching A, B, C the fetcher will fetch A B A C A Is there any way to prevent the r

My wishlist of 12 out of...

2005-08-02 Thread EM
ess and implement these 12 and many, many more. Unfortunately, I'm a student, and by an ironic cliche, time goes mainly to school and working for food. A bit time is left however, for playing with things I like, and I'm glad Nutch is one of them. P.S. I don't know how appropriate is to ask, but, anyone offering a paid position for Nutch development? Keep up the good work, EM in Toronto.

RE: Memory usage2

2005-08-02 Thread EM
Why isn't 'analyze' supported anymore? -Original Message- From: Andy Liu [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 02, 2005 5:44 PM To: nutch-dev@lucene.apache.org Subject: Re: Memory usage2 I have found that merging indexes does help performance significantly. If you're not using

recursion: see recursion

2005-07-29 Thread EM
What to do when encountering sites where nutch falls into recursion mode? Currently I'm solving this by removing these sites with the regex filter, but, is anything under development currently? By recursion I mean nutch fetching .com// and on and on Any tricks to limis the folder depth

ranking algorithm

2005-07-28 Thread EM
Is there a chance that the ranking algorithm in Analyze would give higher value to a subpage than the root domain page? For example: http://abc.com <- 34.432 http://abc.com/something.html <- 50 Is the above scenario possible, or does nutch always rank root pages highest? Regards, EM

whats used from the segments dir when searching

2005-07-25 Thread EM
I'm trying to grasp something here, I need a quick confirmation about the following, a yes/no would suffice: When searching and generating summaries, tomcat uses only: 1. /index 2. /parse_text When retrieving the "cached" copy of the document, tomcat uses: 1. /parse_data Are /fetcher and /contex

fetcher blocked

2005-07-24 Thread EM
I'm using 0.7 from a few weeks ago. I was fetching 204 456 pages, Nutch -segread tells that there are: "segments\20050723140812 is corrupt, using only 207126 entries." Here's what I have with ctrl-break: Full thread dump Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode): "MultiThreadedHttpC