Re: Nutch not crawling all URLs

Sebastian Nagel Mon, 13 Dec 2021 09:02:30 -0800

Hi,

(looping back to user@nutch - sorry, pressed the wrong reply button)


> Some URLs were denied by robots.txt,
> while a few failed with: Http code=403

That's two ways to signalize that these pages shouldn't be crawled,
HTTP 403 means "Forbidden".

> 3. I looked in CrawlDB and most URLs are in there, but were not
> crawled, so this is something that I find very confusing.

The CrawlDb contains also URLs which failed for various reasons.
That's important in order to avoid that 404s, 403s etc. are retried
again and again.

> I also ran some of the URLs that were not crawled through this -
>  bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl
>
> Some of the URLs that failed were parsed successfully, so I'm really
> confused as to why there are no results for them.
>

The "HTTP 403 Forbidden" could be from a "anti-bot protection" software.
If you run parsechecker at a different time or from a different machine,
and not repeatedly or too often it may succeed.

Best,
Sebastian

On 12/13/21 17:48, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you for your reply.
> 
> 1. All URLs were injected, so 20 in total. None was rejected.
> 
> 2. I've had a look at the log files and I can see that some of the URLs could 
> not be fetched because the robot.txt file could not be found. Would this be a 
> reason for why the fetch failed? Is there a way to go around it?
> 
> Some URLs were denied by robots.txt, while a few failed with: Http code=403 
> 
> 3. I looked in CrawlDB and most URLs are in there, but were not crawled, so 
> this is something that I find very confusing.
> 
> I also ran some of the URLs that were not crawled through this - bin/nutch 
> parsechecker -followRedirects -checkRobotsTxt https://myUrl
> 
> Some of the URLs that failed were parsed successfully, so I'm really confused 
> as to why there are no results for them.
> 
> Do you have any suggestions on what I should try?
> 
> Dr Roseline Antai
> Research Fellow
> Hunter Centre for Entrepreneurship
> Strathclyde Business School
> University of Strathclyde, Glasgow, UK
> 
> 
> The University of Strathclyde is a charitable body, registered in Scotland, 
> number SC015263.
> 
> 
> -----Original Message-----
> From: Sebastian Nagel <wastl.na...@googlemail.com> 
> Sent: 13 December 2021 12:19
> To: Roseline Antai <roseline.an...@strath.ac.uk>
> Subject: Re: Nutch not crawling all URLs
> 
> CAUTION: This email originated outside the University. Check before clicking 
> links or attachments.
> 
> Hi Roseline,
> 
>> For instance, when I inject 20 URLs, only 9 are fetched.
> 
> Are there any log messages about the 11 unfetched URLs in the log files.  Try 
> to look for a file "hadoop.log"
> (usually in $NUTCH_HOME/logs/) and look
>  1. how many URLs have been injected.
>     There should be a log message
>      ... Total new urls injected: ...
>  2. If all 20 URLs are injected, there should be log
>     messages about these URLs from the fetcher:
>      FetcherThread ... fetching ...
>     If the fetch fails, there might be a message about
>     this.
>  3. Look into the CrawlDb for the missing URLs.
>       bin/nutch readdb .../crawldb -url <url>
>     or
>       bin/nutch readdb .../crawldb -dump ...
>     You get the command-line options by calling
>       bin/nutch readdb
>     without any arguments
> 
> Alternatively, verify fetching and parsing the URLs by
>   bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl
> 
> 
>> <property>
>>     <name>db.ignore.external.links</name>
>>     <value>true</value>
>> </property>
> 
> Eventually, you want to follow redirects anyway? See
> 
> <property>
>   <name>db.ignore.also.redirects</name>
>   <value>true</value>
>   <description>If true, the fetcher checks redirects the same way as
>   links when ignoring internal or external links. Set to false to
>   follow redirects despite the values for db.ignore.external.links and
>   db.ignore.internal.links.
>   </description>
> </property>
> 
> Best,
> Sebastian
> 
> 
> On 12/13/21 13:02, Roseline Antai wrote:
>> Hi,
>>
>>
>>
>> I am working with Apache nutch 1.18 and Solr. I have set up the system 
>> successfully, but I’m now having the problem that Nutch is refusing to 
>> crawl all the URLs. I am now at a loss as to what I should do to 
>> correct this problem. It fetches about half of the URLs in the seed.txt file.
>>
>>
>>
>> For instance, when I inject 20 URLs, only 9 are fetched. I have made a 
>> number of changes based on the suggestions I saw on the Nutch forum, 
>> as well as on Stack overflow, but nothing seems to work.
>>
>>
>>
>> This is what my nutch-site.xml file looks like:
>>
>>
>>
>>
>>
>> /<?xml version="1.0"?>/
>>
>> /<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>/
>>
>> / /
>>
>> /<!-- Put site-specific property overrides in this file. -->/
>>
>> / /
>>
>> /<configuration>/
>>
>> /<property>/
>>
>> /<name>http.agent.name</name>/
>>
>> /<value>Nutch Crawler</value>/
>>
>> /</property>/
>>
>> /<property>/
>>
>> /<name>http.agent.email</name>                         /
>>
>> /<value>datalake.ng at gmail d</value> /
>>
>> /</property>/
>>
>> /<property>/
>>
>> /    <name>db.ignore.internal.links</name>/
>>
>> /    <value>false</value>/
>>
>> /</property>/
>>
>> /<property>/
>>
>> /    <name>db.ignore.external.links</name>/
>>
>> /    <value>true</value>/
>>
>> /</property>/
>>
>> /<property>/
>>
>> /  <name>plugin.includes</name>/
>>
>> /
>> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
>> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu
>> age-identifier</value>/
>>
>> /</property>/
>>
>> /<property>/
>>
>> /    <name>parser.skip.truncated</name>/
>>
>> /    <value>false</value>/
>>
>> /    <description>Boolean value for whether we should skip parsing for
>> truncated documents. By default this/
>>
>> /        property is activated due to extremely high levels of CPU which
>> parsing can sometimes take./
>>
>> /    </description>/
>>
>> /</property>/
>>
>> /<property>/
>>
>> /   <name>db.max.outlinks.per.page</name>/
>>
>> /   <value>-1</value>/
>>
>> /   <description>The maximum number of outlinks that we'll process for a
>> page./
>>
>> /   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>> outlinks/
>>
>> /   will be processed for a page; otherwise, all outlinks will be
>> processed./
>>
>> /   </description>/
>>
>> /</property>/
>>
>> /<property>/
>>
>> /  <name>http.content.limit</name>/
>>
>> /  <value>-1</value>/
>>
>> /  <description>The length limit for downloaded content using the 
>> http:///
>>
>> /  protocol, in bytes. If this value is nonnegative (>=0), content 
>> longer/
>>
>> /  than it will be truncated; otherwise, no truncation at all. Do not/
>>
>> /  confuse this setting with the file.content.limit setting./
>>
>> /  </description>/
>>
>> /</property>/
>>
>> /<property>/
>>
>> /  <name>db.ignore.external.links.mode</name>/
>>
>> /  <value>byDomain</value>/
>>
>> /</property>/
>>
>> /<property>/
>>
>> /  <name>db.injector.overwrite</name>/
>>
>> /  <value>true</value>/
>>
>> /</property>/
>>
>> /<property>/
>>
>> /  <name>http.timeout</name>/
>>
>> /  <value>50000</value>/
>>
>> /  <description>The default network timeout, in 
>> milliseconds.</description>/
>>
>> /</property>/
>>
>> /</configuration>/
>>
>>
>>
>> Other changes I have made include changing the following in
>> nutch-default.xml:
>>
>>
>>
>> /property>/
>>
>> /  <name>http.redirect.max</name>/
>>
>> /  <value>2</value>/
>>
>> /  <description>The maximum number of redirects the fetcher will 
>> follow when/
>>
>> /  trying to fetch a page. If set to negative or 0, fetcher won't 
>> immediately/
>>
>> /  follow redirected URLs, instead it will record them for later 
>> fetching./
>>
>> /  </description>/
>>
>> /</property>///
>>
>> /**************************************************************/
>>
>> / /
>>
>> |/<property>/|
>>
>> |/  <name>ftp.timeout</name>/|
>>
>> |/  <value>100000</value>/|
>>
>> |/</property>/|
>>
>> |/ /|
>>
>> |/<property>/|
>>
>> |/  <name>ftp.server.timeout</name>/|
>>
>> |/  <value>150000</value>/|
>>
>> |/</property>/|//
>>
>> / /
>>
>> /*************************************************************/
>>
>> / /
>>
>> |/property>/|
>>
>> |/  <name>fetcher.server.delay</name>/|
>>
>> |/  <value>65.0</value>/|
>>
>> |/</property>/|
>>
>> |/ /|
>>
>> |/<property>/|
>>
>> |/  <name>fetcher.server.min.delay</name>/|
>>
>> |/  <value>25.0</value>/|
>>
>> |/</property>/|
>>
>> |/ /|
>>
>> |/<property>/|
>>
>> |/<name>fetcher.max.crawl.delay</name>/|
>>
>> |/<value>70</value>/|
>>
>> |/</property> /|//
>>
>> / /
>>
>> I also commented out the line below in the regex-urlfilter file:
>>
>>
>>
>> |/# skip URLs containing certain characters as probable queries, 
>> |etc./|
>>
>> |/-[?*!@=]/|//
>>
>>
>>
>> Nothing seems to work.
>>
>>
>>
>> What is it that I’m not doing, or doing wrongly here?
>>
>>
>>
>> Regards,
>>
>> Roseline
>>
>>
>>
>> *Dr Roseline Antai*
>>
>> /Research Fellow/
>>
>> Hunter Centre for Entrepreneurship
>>
>> Strathclyde Business School
>>
>> University of Strathclyde, Glasgow, UK
>>
>>
>>
>> Small eMail Sig
>>
>> The University of Strathclyde is a charitable body, registered in 
>> Scotland, number SC015263.
>>
>>
>>
>>
>>

Re: Nutch not crawling all URLs

Reply via email to