segslice usually performs 200-300 records/sec on my machine (quite fast
for everything else, top of the line).
Is it just copying the segments minus the last part or some processing
is required for each record?
Any advise how can it be optimized?
It is possible to configure Linux box (1Mb RAM) with 6000 client threads in
Worker model. It is limited only by amout of available RAM. I used such
configuration in production, 6 Apache servers sustained 75000 of concurrent
users performing 1 request per minute, 4kb HTML pages, load/stress te
202443 Pages consumed: 13 (at index 13). Links fetched: 233386.
202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/].
202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315.
If there is maxoutlinks already specified in the xml config, why does
nutch bother c
We have network equipment limitations too, we can't reach more than
65000 threads over single LAN card, and JVM is good (but better is to
have multiple JVM/processes, 100 threads each...)
65000 threads? What are you trying to fetch? The whole web?
Otis
65000/100 =650 processes.
imes, what information is available to you is determined by the
decision of whoever designed the page right?
It the page tries to be smart and 'determine' what user want's to see,
well, if you don't own that webpage there isn't anything you can do.
Best regards,
EM
't
decreasing ;) I've encountered cases like this, and instead of manually
typing regex to clean them off (which takes time) I'd strongly prefer an
automated solution if possible.
Regards,
EM
r
crawler will run over any website, with 50-500 threads the default three
retry times, and the problem will solve itself out. But, can something
be done for the rest or us, please?
A simple 500 would really be appreciated..
Regards,
EM
Transbuerg Tian wrote:
I have the same conditions li
Doug Cutting wrote:
> (2) Will the dropped urls be picked up again in subsequent cycles of
fetchlist/segment/fetch/updatedb?
They will be retried in the next cycle, up to db.fetch.retry.max.
After the next bin/nutch generate... or are you using 'cycle' for
something else?
EM
I'm currently fetching with 35 threads. The CPU load is about 5-10% (P4 3.0
HT). Parsing obviously isn't using many resources.
Removing parsing also would not speed up the fetching process. If parsing
(while fetching) is removed (with a command line argument), I'll probably
tune the fetcher down t
So, just replace PDFBox-0.7.2-dev.jar from the plugin directory with the
PDFBox-0.7.2-dev-20050825.jar (Renaming the file of course.) ?
Regards,
EM
-Original Message-
From: Andrzej Bialecki (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 25, 2005 6:28 AM
To: nutch-dev
as:
String content = parse.getText();
content +=" ";
content += myTranslationFunctionToLatin(content);
doc.add (Field.Text("content", content);
Or would the last line be:
doc.add(Field.UnStored("content", content));
What's the difference with regard to the Field.* object?
Regards,
EM
Should the following be happening?
Short description:
-fetch bunch of pages.
-status: 5400 fetched, 27 errors
-fetch 22 more pages
-status: 5403 fetched, 27 errors
My regex-urlfilter excludes "jpg" and includes "?"
Long description:
050809 093001 fetching http:///.jpg?6351
050809 093001 fetchin
Suppose we have to fetch 3 pages.
Page A is http://something/login.php
Page B is http://yyy/rrr/ which, when fetched, redirects to page A
Page C is http://yyy/ttt/ which, when fetched, redirects to page A
When fetching A, B, C the fetcher will fetch
A
B
A
C
A
Is there any way to prevent the r
ess and implement
these 12 and many, many more. Unfortunately, I'm a student, and by an ironic
cliche, time goes mainly to school and working for food. A bit time is left
however, for playing with things I like, and I'm glad Nutch is one of them.
P.S. I don't know how appropriate is to ask, but, anyone offering a paid
position for Nutch development?
Keep up the good work,
EM in Toronto.
Why isn't 'analyze' supported anymore?
-Original Message-
From: Andy Liu [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 02, 2005 5:44 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Memory usage2
I have found that merging indexes does help performance significantly.
If you're not using
What to do when encountering sites where nutch falls into recursion mode?
Currently I'm solving this by removing these sites with the regex filter,
but, is anything under development currently?
By recursion I mean nutch fetching
.com// and on and on
Any tricks to limis the folder depth
Is there a chance that the ranking algorithm in Analyze would give higher
value to a subpage than the root domain page?
For example:
http://abc.com <- 34.432
http://abc.com/something.html <- 50
Is the above scenario possible, or does nutch always rank root pages
highest?
Regards,
EM
I'm trying to grasp something here, I need a quick confirmation about the
following, a yes/no would suffice:
When searching and generating summaries, tomcat uses only:
1. /index
2. /parse_text
When retrieving the "cached" copy of the document, tomcat uses:
1. /parse_data
Are /fetcher and /contex
I'm using 0.7 from a few weeks ago.
I was fetching 204 456 pages,
Nutch -segread tells that there are:
"segments\20050723140812 is corrupt, using only 207126 entries."
Here's what I have with ctrl-break:
Full thread dump Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode):
"MultiThreadedHttpC
19 matches
Mail list logo