Re: Nutch 1.9 Fetchers Hung

Vitaly Savicks Fri, 28 Nov 2014 09:37:32 -0800

Hi Issam,
I've had a pretty fast performance yesterday with 10-15 threads and 2
threads per queue. Also, I've set the server delay to 0, but with a
different property. Don't remember its name now. You can find it in fetcher
properties of the nutch-defaults.XML


Looking at your configuration it looks like it's not used though. You
should make sure nutch uses it instead of the defaults. Also, some
parameters from 1.7 are absent or may be different from 1.9. I'd scrap them
and set the new ones according to default file

Cheers,
Vitaly
On 28 Nov 2014 14:36, "Issam Maamria" <[email protected]> wrote:

> Hi all,
>
> I am running the crawl command with depth 2 using a seed file containing
> 120 urls (about 2000 documents). Halfway through, the following output is
> logged:
>
> *-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=51,
> fetchQueues.getQueueCount=24*
>
> And after a while:
>
> *Aborting with 50 hung threads.*
>
> I am trying exactly the same thing using 1.8, and it is *working fine*.
> Please note that I am not applying any customisations apart from the
> following nutch-site.xml:
>
> <?xml version="1.0"?>
>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
>
> <!-- Put site-specific property overrides in this file. -->
>
>
> <configuration>
>
> <property>
>
>  <name>http.agent.name</name>
>
>  <value>MyAgent</value>
>
> </property>
>
>
> <property>
>
>   <name>http.robots.agents</name>
>
>   <value> MyAgent,*</value>
>
>   <description>The agent strings we'll look for in robots.txt files,
>
>   comma-separated, in decreasing order of precedence. You should
>
>   put the value of http.agent.name as the first agent name, and keep the
>
>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>
>   </description>
>
> </property>
>
>
> <!-- HTTP properties -->
>
>
> <property>
>
>   <name>http.redirect.max</name>
>
>   <value>2</value>
>
>   <description>The maximum number of redirects the fetcher will follow
> when
>
>   trying to fetch a page. If set to negative or 0, fetcher won't
> immediately
>
>   follow redirected URLs, instead it will record them for later fetching.
>
>   </description>
>
> </property>
>
>
> <property>
>
>   <name>http.content.limit</name>
>
>   <value>-1</value>
>
>   <description>The length limit for downloaded content using the http://
>
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>
>   than it will be truncated; otherwise, no truncation at all. Do not
>
>   confuse this setting with the file.content.limit setting.
>
>   </description>
>
> </property>
>
>
> <!-- web db properties -->
>
>
> <!-- fetcher properties -->
>
>
> <property>
>
>   <name>fetcher.server.delay</name>
>
>   <value>4.0</value>
>
>   <description>The number of seconds the fetcher will delay between
>
>    successive requests to the same server.</description>
>
> </property>
>
>
> <property>
>
>   <name>fetcher.threads.fetch</name>
>
>   <value>20</value>
>
>   <description>The number of FetcherThreads the fetcher should use.
>
>   This is also determines the maximum number of requests that are
>
>   made at once (each FetcherThread handles one connection). The total
>
>   number of threads running in distributed mode will be the number of
>
>   fetcher threads * number of nodes as fetcher has one map task per node.
>
>   </description>
>
> </property>
>
>
> <property>
>
>   <name>fetcher.threads.per.queue</name>
>
>   <value>10</value>
>
>   <description>This number is the maximum number of threads that
>
>     should be allowed to access a queue at one time.
>
>    </description>
>
> </property>
>
>
> <!-- plugin properties -->
>
>
> <property>
>
>   <name>plugin.includes</name>
>
>   <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|
> urlnormalizer-(pass|regex|basic)</value>
>
>   <!-- <value>protocol-http|urlfilter-regex|parse-(html|tika
> )|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> -->
>
>   <description>Regular expression naming plugin directory names to
>
>   include.  Any plugin not matching this expression is excluded.
>
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>
>   default Nutch includes crawling just HTML and plain text via HTTP,
>
>   and basic indexing and search plugins. In order to use HTTPS please
> enable
>
>   protocol-httpclient, but be aware of possible intermittent problems
> with the
>
>   underlying commons-httpclient library.
>
>   </description>
>
> </property>
>
>
> <!-- parser properties -->
>
>
> <property>
>
>   <name>parser.character.encoding.default</name>
>
>   <value>utf-8</value>
>
>   <description>The character encoding to fall back to when no other
> information
>
>   is available</description>
>
> </property>
>
>
> <property>
>
>   <name>parser.timeout</name>
>
>   <value>-1</value>
>
>   <description>Timeout in seconds for the parsing of a document,
> otherwise treats it as an exception and
>
>   moves on the the following documents. This parameter is applied to any
> Parser implementation.
>
>   Set to -1 to deactivate, bearing in mind that this could cause
>
>   the parsing to crash because of a very long or corrupted document.
>
>   </description>
>
> </property>
>
> </configuration>
>
> ----
>
> Help is greatly appreciated.
>
> Kind regards,
>
> Issam
>

Re: Nutch 1.9 Fetchers Hung

Reply via email to