On Wed, Apr 1, 2009 at 22:47, consultas <[email protected]> wrote:
> Hi, > > I have been using Nuth for some years now. I am using it under Gygwin, > with Windows XP, with 2GB memory, nominal bandwith 6 Megs, using a single > server,with pages in the range of 300,000 for a vertical semi-production > engine. I use 60 threads, using the crawl method for the initial crawl and > end up using the whole web method. Until the last release, in the fetching > phase, I had, on my screen a steady rolling list of the pages being indexed. > Everything worked, almost 100% of the time, quite smoothly. > > Them I tried the new version, and, on the screen, I got some weird > indications, like below, and , unfortunateley, on a turtle like speed: > > fetch of > http://www.greenpeace.org/brasil/transgenicos/noticias/text/javascriptfailed > with: java.net.SocketTimeoutException: Read timed out > -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0 > -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0 > fetch of > http://www.greenpeace.org/international/press/reports/nuclear-waste-crisis-francefailed > with: java.net.SocketTimeoutException: Read timed out > -activeThreads=60, spinWaiting=58, fetchQueues.totalSize=0 > -activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0 > -activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0 > -activeThreads=60, spinWaiting=59, fetchQueues.totalSize=0 > -activeThreads=60, spinWaiting=58, fetchQueues.totalSize=0 > Unable to resolve: www.fishunlimited.org, skipping. > fetching > http://www.forests.org/archived_site/today/recent/1997/forfadef_files/filelist.xml > fetching http://www.rpi.edu/news/podcasts.html > fetching http://www.news24.com/Beeld/Gallery/Home/0,,,00.html > fetching http://www.epo.org/ > -activeThreads=60 <http://www.epo.org/%0A-activeThreads=60>, > spinWaiting=55, fetchQueues.totalSize=0 > fetching http://vcforum.eagle.org/banning.cfm > fetching http://cdn.socialtwist.com/2009022511095/script.js > fetching http://www.lrqa.com.br/treinamento/ > -activeThreads=60<http://www.lrqa.com.br/treinamento/%0A-activeThreads=60>, > spinWaiting=54, fetchQueues.totalSize=0 > fetching http://www.processingtalk.com/news/eme/eme416.html > fetching http://www.sciencedaily.com/releases/2009/03/090324111600.htm > fetching http://www.asnt-glas.org/meetings.htm > -activeThreads=60, spinWaiting=53, fetchQueues.totalSize=0 > fetching http://www.embrapa.gov.br/destaques_imagem/brasil-visto-do-espaco > -activeThreads=60<http://www.embrapa.gov.br/destaques_imagem/brasil-visto-do-espaco%0A-activeThreads=60>, > spinWaiting=57, fetchQueues.totalSize=0 > -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0 > -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0 > -activeThreads=60, spinWaiting=57, fetchQueues.totalSize=0 > fetching http://www.uscg.mil/comdt/blog/2009/01 > fetching > http://www1.eere.energy.gov/inventions/energytechnet/includes/opera/5 > > More than this, very often the fect is aborted with 60 hung trheads and, > when I suceed, it seems ( I am not absolutely sure about this,but with a > very strong feeling, considering the size of the resulting segment), that, > some times the option `topN` is not respected, with less pages fetched than > intended. > > So, I am relating my own experience, as a simple user of Nutch, hoping that > the problems that I faced can be correct, so that I can use Nutch-1.0, wht > is not feasable now. > This log: -activeThreads=60, spinWaiting=53, fetchQueues.totalSize=0 is no big deal. This is nutch showing you information you probably don't need :) During nutch 1.0 development, a new fetcher was developed and it replaced the old fetcher. Because the new fetcher has a better more flexible code base. However, you are not the first person who reported problems with it. You may find tracking this issue useful while this is sorted out: https://issues.apache.org/jira/browse/NUTCH-721 > > Thank you > > -- Doğacan Güney
