Hi, (looping back to user@nutch - sorry, pressed the wrong reply button)
> Some URLs were denied by robots.txt, > while a few failed with: Http code=403 That's two ways to signalize that these pages shouldn't be crawled, HTTP 403 means "Forbidden". > 3. I looked in CrawlDB and most URLs are in there, but were not > crawled, so this is something that I find very confusing. The CrawlDb contains also URLs which failed for various reasons. That's important in order to avoid that 404s, 403s etc. are retried again and again. > I also ran some of the URLs that were not crawled through this - > bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl > > Some of the URLs that failed were parsed successfully, so I'm really > confused as to why there are no results for them. > The "HTTP 403 Forbidden" could be from a "anti-bot protection" software. If you run parsechecker at a different time or from a different machine, and not repeatedly or too often it may succeed. Best, Sebastian On 12/13/21 17:48, Roseline Antai wrote: > Hi Sebastian, > > Thank you for your reply. > > 1. All URLs were injected, so 20 in total. None was rejected. > > 2. I've had a look at the log files and I can see that some of the URLs could > not be fetched because the robot.txt file could not be found. Would this be a > reason for why the fetch failed? Is there a way to go around it? > > Some URLs were denied by robots.txt, while a few failed with: Http code=403 > > 3. I looked in CrawlDB and most URLs are in there, but were not crawled, so > this is something that I find very confusing. > > I also ran some of the URLs that were not crawled through this - bin/nutch > parsechecker -followRedirects -checkRobotsTxt https://myUrl > > Some of the URLs that failed were parsed successfully, so I'm really confused > as to why there are no results for them. > > Do you have any suggestions on what I should try? > > Dr Roseline Antai > Research Fellow > Hunter Centre for Entrepreneurship > Strathclyde Business School > University of Strathclyde, Glasgow, UK > > > The University of Strathclyde is a charitable body, registered in Scotland, > number SC015263. > > > -----Original Message----- > From: Sebastian Nagel <wastl.na...@googlemail.com> > Sent: 13 December 2021 12:19 > To: Roseline Antai <roseline.an...@strath.ac.uk> > Subject: Re: Nutch not crawling all URLs > > CAUTION: This email originated outside the University. Check before clicking > links or attachments. > > Hi Roseline, > >> For instance, when I inject 20 URLs, only 9 are fetched. > > Are there any log messages about the 11 unfetched URLs in the log files. Try > to look for a file "hadoop.log" > (usually in $NUTCH_HOME/logs/) and look > 1. how many URLs have been injected. > There should be a log message > ... Total new urls injected: ... > 2. If all 20 URLs are injected, there should be log > messages about these URLs from the fetcher: > FetcherThread ... fetching ... > If the fetch fails, there might be a message about > this. > 3. Look into the CrawlDb for the missing URLs. > bin/nutch readdb .../crawldb -url <url> > or > bin/nutch readdb .../crawldb -dump ... > You get the command-line options by calling > bin/nutch readdb > without any arguments > > Alternatively, verify fetching and parsing the URLs by > bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl > > >> <property> >> <name>db.ignore.external.links</name> >> <value>true</value> >> </property> > > Eventually, you want to follow redirects anyway? See > > <property> > <name>db.ignore.also.redirects</name> > <value>true</value> > <description>If true, the fetcher checks redirects the same way as > links when ignoring internal or external links. Set to false to > follow redirects despite the values for db.ignore.external.links and > db.ignore.internal.links. > </description> > </property> > > Best, > Sebastian > > > On 12/13/21 13:02, Roseline Antai wrote: >> Hi, >> >> >> >> I am working with Apache nutch 1.18 and Solr. I have set up the system >> successfully, but I’m now having the problem that Nutch is refusing to >> crawl all the URLs. I am now at a loss as to what I should do to >> correct this problem. It fetches about half of the URLs in the seed.txt file. >> >> >> >> For instance, when I inject 20 URLs, only 9 are fetched. I have made a >> number of changes based on the suggestions I saw on the Nutch forum, >> as well as on Stack overflow, but nothing seems to work. >> >> >> >> This is what my nutch-site.xml file looks like: >> >> >> >> >> >> /<?xml version="1.0"?>/ >> >> /<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>/ >> >> / / >> >> /<!-- Put site-specific property overrides in this file. -->/ >> >> / / >> >> /<configuration>/ >> >> /<property>/ >> >> /<name>http.agent.name</name>/ >> >> /<value>Nutch Crawler</value>/ >> >> /</property>/ >> >> /<property>/ >> >> /<name>http.agent.email</name> / >> >> /<value>datalake.ng at gmail d</value> / >> >> /</property>/ >> >> /<property>/ >> >> / <name>db.ignore.internal.links</name>/ >> >> / <value>false</value>/ >> >> /</property>/ >> >> /<property>/ >> >> / <name>db.ignore.external.links</name>/ >> >> / <value>true</value>/ >> >> /</property>/ >> >> /<property>/ >> >> / <name>plugin.includes</name>/ >> >> / >> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an >> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu >> age-identifier</value>/ >> >> /</property>/ >> >> /<property>/ >> >> / <name>parser.skip.truncated</name>/ >> >> / <value>false</value>/ >> >> / <description>Boolean value for whether we should skip parsing for >> truncated documents. By default this/ >> >> / property is activated due to extremely high levels of CPU which >> parsing can sometimes take./ >> >> / </description>/ >> >> /</property>/ >> >> /<property>/ >> >> / <name>db.max.outlinks.per.page</name>/ >> >> / <value>-1</value>/ >> >> / <description>The maximum number of outlinks that we'll process for a >> page./ >> >> / If this value is nonnegative (>=0), at most db.max.outlinks.per.page >> outlinks/ >> >> / will be processed for a page; otherwise, all outlinks will be >> processed./ >> >> / </description>/ >> >> /</property>/ >> >> /<property>/ >> >> / <name>http.content.limit</name>/ >> >> / <value>-1</value>/ >> >> / <description>The length limit for downloaded content using the >> http:/// >> >> / protocol, in bytes. If this value is nonnegative (>=0), content >> longer/ >> >> / than it will be truncated; otherwise, no truncation at all. Do not/ >> >> / confuse this setting with the file.content.limit setting./ >> >> / </description>/ >> >> /</property>/ >> >> /<property>/ >> >> / <name>db.ignore.external.links.mode</name>/ >> >> / <value>byDomain</value>/ >> >> /</property>/ >> >> /<property>/ >> >> / <name>db.injector.overwrite</name>/ >> >> / <value>true</value>/ >> >> /</property>/ >> >> /<property>/ >> >> / <name>http.timeout</name>/ >> >> / <value>50000</value>/ >> >> / <description>The default network timeout, in >> milliseconds.</description>/ >> >> /</property>/ >> >> /</configuration>/ >> >> >> >> Other changes I have made include changing the following in >> nutch-default.xml: >> >> >> >> /property>/ >> >> / <name>http.redirect.max</name>/ >> >> / <value>2</value>/ >> >> / <description>The maximum number of redirects the fetcher will >> follow when/ >> >> / trying to fetch a page. If set to negative or 0, fetcher won't >> immediately/ >> >> / follow redirected URLs, instead it will record them for later >> fetching./ >> >> / </description>/ >> >> /</property>/// >> >> /**************************************************************/ >> >> / / >> >> |/<property>/| >> >> |/ <name>ftp.timeout</name>/| >> >> |/ <value>100000</value>/| >> >> |/</property>/| >> >> |/ /| >> >> |/<property>/| >> >> |/ <name>ftp.server.timeout</name>/| >> >> |/ <value>150000</value>/| >> >> |/</property>/|// >> >> / / >> >> /*************************************************************/ >> >> / / >> >> |/property>/| >> >> |/ <name>fetcher.server.delay</name>/| >> >> |/ <value>65.0</value>/| >> >> |/</property>/| >> >> |/ /| >> >> |/<property>/| >> >> |/ <name>fetcher.server.min.delay</name>/| >> >> |/ <value>25.0</value>/| >> >> |/</property>/| >> >> |/ /| >> >> |/<property>/| >> >> |/<name>fetcher.max.crawl.delay</name>/| >> >> |/<value>70</value>/| >> >> |/</property> /|// >> >> / / >> >> I also commented out the line below in the regex-urlfilter file: >> >> >> >> |/# skip URLs containing certain characters as probable queries, >> |etc./| >> >> |/-[?*!@=]/|// >> >> >> >> Nothing seems to work. >> >> >> >> What is it that I’m not doing, or doing wrongly here? >> >> >> >> Regards, >> >> Roseline >> >> >> >> *Dr Roseline Antai* >> >> /Research Fellow/ >> >> Hunter Centre for Entrepreneurship >> >> Strathclyde Business School >> >> University of Strathclyde, Glasgow, UK >> >> >> >> Small eMail Sig >> >> The University of Strathclyde is a charitable body, registered in >> Scotland, number SC015263. >> >> >> >> >>