Thank you Dogacan, for your very prompt reply (I was truly amazed, thanks)

I would like do point, however, that apart the very slow behaviour of the fetcher (it reminds me when the version 0.8 was launched), it seems that the fetcher fases end with hangup threads and it seems, also, that it does not respect (it seems), sometimes, the "topN" choices made. It may be the case that the fetcher 2 is optmized for someone using several servers, but for a single server (and I think this a very large portion of Nutch users), it does not works very well. At least for a couple of experiencies I did, reminding once again, that with the previou version (including some recent nightly drives), everythting worke quite well.

Tanks again, for your attention and I really want a very big success for Nutch.





----- Original Message ----- From: "Doğacan Güney" <[email protected]>
To: <[email protected]>
Sent: Wednesday, April 01, 2009 4:54 PM
Subject: Re: Nutch 1.0 experience


On Wed, Apr 1, 2009 at 22:47, consultas <[email protected]> wrote:

Hi,

I have been using Nuth for some years now.  I am using it under Gygwin,
with Windows XP, with 2GB memory, nominal bandwith 6 Megs,  using a single
server,with pages in the range of 300,000 for a vertical semi-production
engine. I use 60 threads, using the crawl method for the initial crawl and end up using the whole web method. Until the last release, in the fetching phase, I had, on my screen a steady rolling list of the pages being indexed.
 Everything worked, almost 100% of the time, quite smoothly.

Them I tried the new version, and, on the screen, I got some weird
indications, like below, and , unfortunateley, on a turtle like speed:

fetch of
http://www.greenpeace.org/brasil/transgenicos/noticias/text/javascriptfailed with: java.net.SocketTimeoutException: Read timed out
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
fetch of
http://www.greenpeace.org/international/press/reports/nuclear-waste-crisis-francefailed with: java.net.SocketTimeoutException: Read timed out
-activeThreads=60, spinWaiting=58, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=58, fetchQueues.totalSize=0
Unable to resolve: www.fishunlimited.org, skipping.
fetching
http://www.forests.org/archived_site/today/recent/1997/forfadef_files/filelist.xml
fetching http://www.rpi.edu/news/podcasts.html
fetching http://www.news24.com/Beeld/Gallery/Home/0,,,00.html
fetching http://www.epo.org/
-activeThreads=60 <http://www.epo.org/%0A-activeThreads=60>,
spinWaiting=55, fetchQueues.totalSize=0
fetching http://vcforum.eagle.org/banning.cfm
fetching http://cdn.socialtwist.com/2009022511095/script.js
fetching http://www.lrqa.com.br/treinamento/
-activeThreads=60<http://www.lrqa.com.br/treinamento/%0A-activeThreads=60>,
spinWaiting=54, fetchQueues.totalSize=0
fetching http://www.processingtalk.com/news/eme/eme416.html
fetching http://www.sciencedaily.com/releases/2009/03/090324111600.htm
fetching http://www.asnt-glas.org/meetings.htm
-activeThreads=60, spinWaiting=53, fetchQueues.totalSize=0
fetching http://www.embrapa.gov.br/destaques_imagem/brasil-visto-do-espaco
-activeThreads=60<http://www.embrapa.gov.br/destaques_imagem/brasil-visto-do-espaco%0A-activeThreads=60>,
spinWaiting=57, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
-activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0
fetching http://www.uscg.mil/comdt/blog/2009/01
fetching
http://www1.eere.energy.gov/inventions/energytechnet/includes/opera/5

More than this, very often the fect is aborted with 60 hung trheads and,
when I suceed, it seems ( I am not absolutely sure about this,but with a
very strong feeling, considering the size of the resulting segment), that, some times the option `topN` is not respected, with less pages fetched than
intended.

So, I am relating my own experience, as a simple user of Nutch, hoping that
the problems that I faced can be correct, so that I can use Nutch-1.0, wht
is not feasable now.


This log:

-activeThreads=60, spinWaiting=53, fetchQueues.totalSize=0

is no big deal. This is nutch showing you information you probably
don't need :)

During nutch 1.0 development, a new fetcher was developed and
it replaced the old fetcher. Because the new fetcher has a better more
flexible code base. However, you are not the first person who reported
problems with it. You may find tracking this issue useful while this
is sorted out:

https://issues.apache.org/jira/browse/NUTCH-721



Thank you




--
Doğacan Güney



--------------------------------------------------------------------------------



No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.0.238 / Virus Database: 270.11.35/2034 - Release Date: 04/01/09 06:06:00

Reply via email to