Thank you Dogacan, for your very prompt reply (I was truly amazed, thanks)
I would like do point, however, that apart the very slow behaviour of the
fetcher (it reminds me when the version 0.8 was launched), it seems that
the fetcher fases end with hangup threads and it seems, also, that it does
not respect (it seems), sometimes, the "topN" choices made. It may be the
case that the fetcher 2 is optmized for someone using several servers, but
for a single server (and I think this a very large portion of Nutch users),
it does not works very well. At least for a couple of experiencies I did,
reminding once again, that with the previou version (including some recent
nightly drives), everythting worke quite well.
Tanks again, for your attention and I really want a very big success for
Nutch.
----- Original Message -----
From: "Doğacan Güney" <[email protected]>
To: <[email protected]>
Sent: Wednesday, April 01, 2009 4:54 PM
Subject: Re: Nutch 1.0 experience
On Wed, Apr 1, 2009 at 22:47, consultas <[email protected]> wrote:
Hi,
I have been using Nuth for some years now. I am using it under Gygwin,
with Windows XP, with 2GB memory, nominal bandwith 6 Megs, using a single
server,with pages in the range of 300,000 for a vertical semi-production
engine. I use 60 threads, using the crawl method for the initial crawl
and
end up using the whole web method. Until the last release, in the
fetching
phase, I had, on my screen a steady rolling list of the pages being
indexed.
Everything worked, almost 100% of the time, quite smoothly.
Them I tried the new version, and, on the screen, I got some weird
indications, like below, and , unfortunateley, on a turtle like speed:
fetch of
http://www.greenpeace.org/brasil/transgenicos/noticias/text/javascriptfailed
with: java.net.SocketTimeoutException: Read timed out
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
fetch of
http://www.greenpeace.org/international/press/reports/nuclear-waste-crisis-francefailed
with: java.net.SocketTimeoutException: Read timed out
-activeThreads=60, spinWaiting=58, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=58, fetchQueues.totalSize=0
Unable to resolve: www.fishunlimited.org, skipping.
fetching
http://www.forests.org/archived_site/today/recent/1997/forfadef_files/filelist.xml
fetching http://www.rpi.edu/news/podcasts.html
fetching http://www.news24.com/Beeld/Gallery/Home/0,,,00.html
fetching http://www.epo.org/
-activeThreads=60 <http://www.epo.org/%0A-activeThreads=60>,
spinWaiting=55, fetchQueues.totalSize=0
fetching http://vcforum.eagle.org/banning.cfm
fetching http://cdn.socialtwist.com/2009022511095/script.js
fetching http://www.lrqa.com.br/treinamento/
-activeThreads=60<http://www.lrqa.com.br/treinamento/%0A-activeThreads=60>,
spinWaiting=54, fetchQueues.totalSize=0
fetching http://www.processingtalk.com/news/eme/eme416.html
fetching http://www.sciencedaily.com/releases/2009/03/090324111600.htm
fetching http://www.asnt-glas.org/meetings.htm
-activeThreads=60, spinWaiting=53, fetchQueues.totalSize=0
fetching http://www.embrapa.gov.br/destaques_imagem/brasil-visto-do-espaco
-activeThreads=60<http://www.embrapa.gov.br/destaques_imagem/brasil-visto-do-espaco%0A-activeThreads=60>,
spinWaiting=57, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
fetching http://www.uscg.mil/comdt/blog/2009/01
fetching
http://www1.eere.energy.gov/inventions/energytechnet/includes/opera/5
More than this, very often the fect is aborted with 60 hung trheads and,
when I suceed, it seems ( I am not absolutely sure about this,but with a
very strong feeling, considering the size of the resulting segment),
that,
some times the option `topN` is not respected, with less pages fetched
than
intended.
So, I am relating my own experience, as a simple user of Nutch, hoping
that
the problems that I faced can be correct, so that I can use Nutch-1.0, wht
is not feasable now.
This log:
-activeThreads=60, spinWaiting=53, fetchQueues.totalSize=0
is no big deal. This is nutch showing you information you probably
don't need :)
During nutch 1.0 development, a new fetcher was developed and
it replaced the old fetcher. Because the new fetcher has a better more
flexible code base. However, you are not the first person who reported
problems with it. You may find tracking this issue useful while this
is sorted out:
https://issues.apache.org/jira/browse/NUTCH-721
Thank you
--
Doğacan Güney
--------------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.0.238 / Virus Database: 270.11.35/2034 - Release Date: 04/01/09
06:06:00