Re: Nutch not crawling all URLs

lewis john mcgibbney Mon, 13 Dec 2021 09:18:23 -0800

Hi Roseline,
Looks like you are ignoring external URLs… that could be the problem right
there.
I encourage you to track counters on inject, generate and fetch phases to
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we
can try.
Thank you
lewismc


On Mon, Dec 13, 2021 at 04:02 <user-digest-h...@nutch.apache.org> wrote:

>
> user Digest 13 Dec 2021 12:02:41 -0000 Issue 3132
>
> Topics (messages 34682 through 34682)
>
> Nutch not crawling all URLs
>         34682 by: Roseline Antai
>
> Administrivia:
>
> ---------------------------------------------------------------------
> To post to the list, e-mail: user@nutch.apache.org
> To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> ----------------------------------------------------------------------
>
>
>
>
> ---------- Forwarded message ----------
> From: Roseline Antai <roseline.an...@strath.ac.uk>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Mon, 13 Dec 2021 12:02:26 +0000
> Subject: Nutch not crawling all URLs
>
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system
> successfully, but I’m now having the problem that Nutch is refusing to
> crawl all the URLs. I am now at a loss as to what I should do to correct
> this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a
> number of changes based on the suggestions I saw on the Nutch forum, as
> well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> *<?xml version="1.0"?>*
>
> *<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>*
>
>
>
> *<!-- Put site-specific property overrides in this file. -->*
>
>
>
> *<configuration>*
>
> *<property>*
>
> *<name>http.agent.name <http://http.agent.name></name>*
>
> *<value>Nutch Crawler</value>*
>
> *</property>*
>
> *<property>*
>
> *<name>http.agent.email</name>                         *
>
> *<value>datalake.ng at gmail d</value> *
>
> *</property>*
>
> *<property>*
>
> *    <name>db.ignore.internal.links</name>*
>
> *    <value>false</value>*
>
> *</property>*
>
> *<property>*
>
> *    <name>db.ignore.external.links</name>*
>
> *    <value>true</value>*
>
> *</property>*
>
> *<property>*
>
> *  <name>plugin.includes</name>*
>
> *
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value>*
>
> *</property>*
>
> *<property>*
>
> *    <name>parser.skip.truncated</name>*
>
> *    <value>false</value>*
>
> *    <description>Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *        property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> *    </description>*
>
> *</property>*
>
> *<property>*
>
> *   <name>db.max.outlinks.per.page
> <http://db.max.outlinks.per.page></name>*
>
> *   <value>-1</value>*
>
> *   <description>The maximum number of outlinks that we'll process for a
> page.*
>
> *   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> <http://db.max.outlinks.per.page> outlinks*
>
> *   will be processed for a page; otherwise, all outlinks will be
> processed.*
>
> *   </description>*
>
> *</property>*
>
> *<property>*
>
> *  <name>http.content.limit</name>*
>
> *  <value>-1</value>*
>
> *  <description>The length limit for downloaded content using the http://*
>
> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
>
> *  than it will be truncated; otherwise, no truncation at all. Do not*
>
> *  confuse this setting with the file.content.limit setting.*
>
> *  </description>*
>
> *</property>*
>
> *<property>*
>
> *  <name>db.ignore.external.links.mode</name>*
>
> *  <value>byDomain</value>*
>
> *</property>*
>
> *<property>*
>
> *  <name>db.injector.overwrite</name>*
>
> *  <value>true</value>*
>
> *</property>*
>
> *<property>*
>
> *  <name>http.timeout</name>*
>
> *  <value>50000</value>*
>
> *  <description>The default network timeout, in
> milliseconds.</description>*
>
> *</property>*
>
> *</configuration>*
>
>
>
> Other changes I have made include changing the following in
> nutch-default.xml:
>
>
>
> *property>*
>
> *  <name>http.redirect.max</name>*
>
> *  <value>2</value>*
>
> *  <description>The maximum number of redirects the fetcher will follow
> when*
>
> *  trying to fetch a page. If set to negative or 0, fetcher won't
> immediately*
>
> *  follow redirected URLs, instead it will record them for later fetching.*
>
> *  </description>*
>
> *</property>*
>
> ****************************************************************
>
>
>
> *<property>*
>
> *  <name>ftp.timeout</name>*
>
> *  <value>100000</value>*
>
> *</property>*
>
>
>
> *<property>*
>
> *  <name>ftp.server.timeout</name>*
>
> *  <value>150000</value>*
>
> *</property>*
>
>
>
> ***************************************************************
>
>
>
> *property>*
>
> *  <name>fetcher.server.delay</name>*
>
> *  <value>65.0</value>*
>
> *</property>*
>
>
>
> *<property>*
>
> *  <name>fetcher.server.min.delay</name>*
>
> *  <value>25.0</value>*
>
> *</property>*
>
>
>
> *<property>*
>
> * <name>fetcher.max.crawl.delay</name>*
>
> * <value>70</value>*
>
> *</property> *
>
>
>
> I also commented out the line below in the regex-urlfilter file:
>
>
>
> *# skip URLs containing certain characters as probable queries, etc.*
>
> *-[?*!@=]*
>
>
>
> Nothing seems to work.
>
>
>
> What is it that I’m not doing, or doing wrongly here?
>
>
>
> Regards,
>
> Roseline
>
>
>
> *Dr Roseline Antai*
>
> *Research Fellow*
>
> Hunter Centre for Entrepreneurship
>
> Strathclyde Business School
>
> University of Strathclyde, Glasgow, UK
>
>
>
> [image: Small eMail Sig]
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.
>
>
>
>
>
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Re: Nutch not crawling all URLs

Reply via email to