RE: Nutch not crawling all URLs

Roseline Antai Mon, 13 Dec 2021 13:33:42 -0800

Hi Lewis,

I got a really weird reply back from what I sent, so I thought it better to 
resend the URLs again. I'm unsure if you got the URLs in the first instance.


I've sent them as a text file attachment as well.

http://traivefinance.com
http://www.ceibal.edu.uy
http://www.talovstudio.com
https://portaltelemedicina.com.br/en/telediagnostic-platform
http://www.notco.com
http://www.saiph.org
http://www.1doc3.com
http://www.amanda-care.com
http://www.unimadx.com
http://www.upch.edu.pe/bioinformatic/anemia/app/
http://www.u-planner.com
http://alerce.science
http://paraempleo.mtess.gov.py
http://layers.hemav.com
http://www.sisben.gov.co
http://ialab.com.ar
http://www.kilimo.com.ar
https://www.facebook.com/CIRSYS
http://www.dymaxionlabs.com
http://cedo.org

Regards,
Roseline

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK


The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.


-----Original Message-----
From: lewis john mcgibbney <[email protected]> 
Sent: 13 December 2021 17:18
To: [email protected]
Subject: Re: Nutch not crawling all URLs

CAUTION: This email originated outside the University. Check before clicking 
links or attachments.

Hi Roseline,
Looks like you are ignoring external URLs... that could be the problem right 
there.
I encourage you to track counters on inject, generate and fetch phases to 
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we can 
try.
Thank you
lewismc

On Mon, Dec 13, 2021 at 04:02 <[email protected]> wrote:

>
> user Digest 13 Dec 2021 12:02:41 -0000 Issue 3132
>
> Topics (messages 34682 through 34682)
>
> Nutch not crawling all URLs
>         34682 by: Roseline Antai
>
> Administrivia:
>
> ---------------------------------------------------------------------
> To post to the list, e-mail: [email protected] To unsubscribe, 
> e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
> ----------------------------------------------------------------------
>
>
>
>
> ---------- Forwarded message ----------
> From: Roseline Antai <[email protected]>
> To: "[email protected]" <[email protected]>
> Cc:
> Bcc:
> Date: Mon, 13 Dec 2021 12:02:26 +0000
> Subject: Nutch not crawling all URLs
>
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system 
> successfully, but I'm now having the problem that Nutch is refusing to 
> crawl all the URLs. I am now at a loss as to what I should do to 
> correct this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a 
> number of changes based on the suggestions I saw on the Nutch forum, 
> as well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> *<?xml version="1.0"?>*
>
> *<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>*
>
>
>
> *<!-- Put site-specific property overrides in this file. -->*
>
>
>
> *<configuration>*
>
> *<property>*
>
> *<name>http.agent.name 
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhttp
> .agent.name%2F&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7C3ed3
> 407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee5944e%7C0%7C
> 0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Or%2Ft4Sp
> S%2BtOnYXTPPXvnlEdYHapSd84pJU4klj9Tkkg%3D&amp;reserved=0></name>*
>
> *<value>Nutch Crawler</value>*
>
> *</property>*
>
> *<property>*
>
> *<name>http.agent.email</name>                         *
>
> *<value>datalake.ng at gmail d</value> *
>
> *</property>*
>
> *<property>*
>
> *    <name>db.ignore.internal.links</name>*
>
> *    <value>false</value>*
>
> *</property>*
>
> *<property>*
>
> *    <name>db.ignore.external.links</name>*
>
> *    <value>true</value>*
>
> *</property>*
>
> *<property>*
>
> *  <name>plugin.includes</name>*
>
> *
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu
> age-identifier</value>*
>
> *</property>*
>
> *<property>*
>
> *    <name>parser.skip.truncated</name>*
>
> *    <value>false</value>*
>
> *    <description>Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *        property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> *    </description>*
>
> *</property>*
>
> *<property>*
>
> *   <name>db.max.outlinks.per.page
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdb.m
> ax.outlinks.per.page%2F&amp;data=04%7C01%7Croseline.antai%40strath.ac.
> uk%7C3ed3407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee594
> 4e%7C0%7C0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=
> hSYwQY8gfRV8uPs5X5jYS4t8Y%2FJ1QEfxykV9Fv183ho%3D&amp;reserved=0></name
> >*
>
> *   <value>-1</value>*
>
> *   <description>The maximum number of outlinks that we'll process for a
> page.*
>
> *   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdb.m
> ax.outlinks.per.page%2F&amp;data=04%7C01%7Croseline.antai%40strath.ac.
> uk%7C3ed3407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee594
> 4e%7C0%7C0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=
> hSYwQY8gfRV8uPs5X5jYS4t8Y%2FJ1QEfxykV9Fv183ho%3D&amp;reserved=0> 
> outlinks*
>
> *   will be processed for a page; otherwise, all outlinks will be
> processed.*
>
> *   </description>*
>
> *</property>*
>
> *<property>*
>
> *  <name>http.content.limit</name>*
>
> *  <value>-1</value>*
>
> *  <description>The length limit for downloaded content using the 
> http://*
>
> *  protocol, in bytes. If this value is nonnegative (>=0), content 
> longer*
>
> *  than it will be truncated; otherwise, no truncation at all. Do not*
>
> *  confuse this setting with the file.content.limit setting.*
>
> *  </description>*
>
> *</property>*
>
> *<property>*
>
> *  <name>db.ignore.external.links.mode</name>*
>
> *  <value>byDomain</value>*
>
> *</property>*
>
> *<property>*
>
> *  <name>db.injector.overwrite</name>*
>
> *  <value>true</value>*
>
> *</property>*
>
> *<property>*
>
> *  <name>http.timeout</name>*
>
> *  <value>50000</value>*
>
> *  <description>The default network timeout, in
> milliseconds.</description>*
>
> *</property>*
>
> *</configuration>*
>
>
>
> Other changes I have made include changing the following in
> nutch-default.xml:
>
>
>
> *property>*
>
> *  <name>http.redirect.max</name>*
>
> *  <value>2</value>*
>
> *  <description>The maximum number of redirects the fetcher will 
> follow
> when*
>
> *  trying to fetch a page. If set to negative or 0, fetcher won't
> immediately*
>
> *  follow redirected URLs, instead it will record them for later 
> fetching.*
>
> *  </description>*
>
> *</property>*
>
> ****************************************************************
>
>
>
> *<property>*
>
> *  <name>ftp.timeout</name>*
>
> *  <value>100000</value>*
>
> *</property>*
>
>
>
> *<property>*
>
> *  <name>ftp.server.timeout</name>*
>
> *  <value>150000</value>*
>
> *</property>*
>
>
>
> ***************************************************************
>
>
>
> *property>*
>
> *  <name>fetcher.server.delay</name>*
>
> *  <value>65.0</value>*
>
> *</property>*
>
>
>
> *<property>*
>
> *  <name>fetcher.server.min.delay</name>*
>
> *  <value>25.0</value>*
>
> *</property>*
>
>
>
> *<property>*
>
> * <name>fetcher.max.crawl.delay</name>*
>
> * <value>70</value>*
>
> *</property> *
>
>
>
> I also commented out the line below in the regex-urlfilter file:
>
>
>
> *# skip URLs containing certain characters as probable queries, etc.*
>
> *-[?*!@=]*
>
>
>
> Nothing seems to work.
>
>
>
> What is it that I'm not doing, or doing wrongly here?
>
>
>
> Regards,
>
> Roseline
>
>
>
> *Dr Roseline Antai*
>
> *Research Fellow*
>
> Hunter Centre for Entrepreneurship
>
> Strathclyde Business School
>
> University of Strathclyde, Glasgow, UK
>
>
>
> [image: Small eMail Sig]
>
> The University of Strathclyde is a charitable body, registered in 
> Scotland, number SC015263.
>
>
>
>
>
--
https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhome.apache.org%2F~lewismc%2F&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7C3ed3407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=6GSPr1UraOAm8flb84ifucmYDHjhdYX%2B%2BCD56neS%2BOo%3D&amp;reserved=0
https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpeople.apache.org%2Fkeys%2Fcommitter%2Flewismc&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7C3ed3407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750127056888961%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=7hv4JOfeHg7wAj7rVCSTAvNp3qtzuQSy5wbWGWgVl9w%3D&amp;reserved=0

http://traivefinance.com
http://www.ceibal.edu.uy
http://www.talovstudio.com
https://portaltelemedicina.com.br/en/telediagnostic-platform
http://www.notco.com
http://www.saiph.org
http://www.1doc3.com
http://www.amanda-care.com
http://www.unimadx.com
http://www.upch.edu.pe/bioinformatic/anemia/app/
http://www.u-planner.com
http://alerce.science
http://paraempleo.mtess.gov.py
http://layers.hemav.com
http://www.sisben.gov.co
http://ialab.com.ar
http://www.kilimo.com.ar
https://www.facebook.com/CIRSYS
http://www.dymaxionlabs.com
http://cedo.org

RE: Nutch not crawling all URLs

Reply via email to