Hi,
Following on from my previous enquiry, I was told to send the URLs I was trying
to crawl to be tried from your end. I sent these, but did not receive any
confirmation of receipt. Can you please confirm if these have been received,
and when I can look forward to getting some feedback?
I re-crawled the 20 URLs again and reset these values to the default values
from the nutch-default.xml file:
property>
<name>fetcher.server.delay</name>
<value>65.0</value>
</property>
<property>
<name>fetcher.server.min.delay</name>
<value>25.0</value>
</property>
<property>
<name>fetcher.max.crawl.delay</name>
<value>70</value>
</property>
I then set the ignore external links to false, as below:
<property>
<name>db.ignore.external.links</name>
<value>false</value>
</property>
I set the following property to 'true' still:
<property>
<name>db.ignore.also.redirects</name>
<value>true</value>
<description>If true, the fetcher checks redirects the same way as
links when ignoring internal or external links. Set to false to
follow redirects despite the values for db.ignore.external.links and
db.ignore.internal.links.
</description>
</property>
13 URLs were fetched, but of these, the URLs that were originally not fetched
returned very few pages related to the domain in the URL, and this makes me
question the crawl.
Also, when external links are not ignored, the crawler does go off onto
different sites, like Wikipedia, news sites, etc. This is hardly efficient as
it spends so long on the crawl fetching irrelevant pages. How can this be
controlled in Nutch? If crawling up to 900 URLs as we are going to be doing,
will we have to write regex expressions for each URL in the regex-urlfilter in
order to stick to the domains in the URL?
There is no explicit documentation on how to do this in Nutch, unless I have
missed it?
Is there something that should be done that I'm not doing, or is Nutch just
incapable of efficient crawling?
Regards,
Roseline
When I crawled,
Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK
[Small eMail Sig]
The University of Strathclyde is a charitable body, registered in Scotland,
number SC015263.
From: Roseline Antai
Sent: 13 December 2021 12:02
To: '[email protected]' <[email protected]>
Subject: Nutch not crawling all URLs
Hi,
I am working with Apache nutch 1.18 and Solr. I have set up the system
successfully, but I'm now having the problem that Nutch is refusing to crawl
all the URLs. I am now at a loss as to what I should do to correct this
problem. It fetches about half of the URLs in the seed.txt file.
For instance, when I inject 20 URLs, only 9 are fetched. I have made a number
of changes based on the suggestions I saw on the Nutch forum, as well as on
Stack overflow, but nothing seems to work.
This is what my nutch-site.xml file looks like:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>Nutch Crawler</value>
</property>
<property>
<name>http.agent.email</name>
<value>datalake.ng at gmail d</value>
</property>
<property>
<name>db.ignore.internal.links</name>
<value>false</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value>
</property>
<property>
<name>parser.skip.truncated</name>
<value>false</value>
<description>Boolean value for whether we should skip parsing for truncated
documents. By default this
property is activated due to extremely high levels of CPU which parsing
can sometimes take.
</description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>db.ignore.external.links.mode</name>
<value>byDomain</value>
</property>
<property>
<name>db.injector.overwrite</name>
<value>true</value>
</property>
<property>
<name>http.timeout</name>
<value>50000</value>
<description>The default network timeout, in milliseconds.</description>
</property>
</configuration>
Other changes I have made include changing the following in nutch-default.xml:
property>
<name>http.redirect.max</name>
<value>2</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>
**************************************************************
<property>
<name>ftp.timeout</name<ftp://ftp.timeout</name>>
<value>100000</value>
</property>
<property>
<name>ftp.server.timeout</name<ftp://ftp.server.timeout</name>>
<value>150000</value>
</property>
*************************************************************
property>
<name>fetcher.server.delay</name>
<value>65.0</value>
</property>
<property>
<name>fetcher.server.min.delay</name>
<value>25.0</value>
</property>
<property>
<name>fetcher.max.crawl.delay</name>
<value>70</value>
</property>
I also commented out the line below in the regex-urlfilter file:
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
Nothing seems to work.
What is it that I'm not doing, or doing wrongly here?
Regards,
Roseline
Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK
[Small eMail Sig]
The University of Strathclyde is a charitable body, registered in Scotland,
number SC015263.