Hi Sebastian, For some reason, the mail below went to my junk folder and I didn't see it.
The notco page - https://notco.com/ was not indexed, no. When I enabled redirects, I was able to get a few pages, but they don't seem valid. Could you confirm if you received all the urls I sent? Another question I have about Nutch is if it has problems with crawling javascript pages? I would ideally love to make the crawler work for my URLs than start checking for other crawlers and waste all the work so far. Just adding again, this is what my nutch-site.xml looks like: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>Nutch Crawler</value> </property> <property> <name>http.agent.email</name> <value>datalake.ng at gmail d</value> </property> <property> <name>db.ignore.internal.links</name> <value>false</value> </property> <property> <name>db.ignore.external.links</name> <value>true</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value> </property> <property> <name>parser.skip.truncated</name> <value>false</value> <description>Boolean value for whether we should skip parsing for truncated documents. By default this property is activated due to extremely high levels of CPU which parsing can sometimes take. </description> </property> <property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> <property> <name>http.content.limit</name> <value>-1</value> <description>The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description> </property> <property> <name>db.ignore.external.links.mode</name> <value>byHost</value> </property> <property> <name>db.injector.overwrite</name> <value>true</value> </property> <property> <name>http.timeout</name> <value>50000</value> <description>The default network timeout, in milliseconds.</description> </property> </configuration> Regards, Roseline -----Original Message----- From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> Sent: 13 December 2021 17:35 To: user@nutch.apache.org Subject: Re: Nutch not crawling all URLs CAUTION: This email originated outside the University. Check before clicking links or attachments. Hi Roseline, > 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.notco.com%2F&data=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rPjrY5Lr3LWwK0%2BB%2FOibIDmKHGjvQRntpN6jCb4iZRs%3D&reserved=0 What is the status for https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhich&data=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2FAsVkcpGQhNGDGvpdZ7stxEaPM%2BQlrEfsWhZOnJEhZQ%3D&reserved=0 is the final redirect target? Is the target page indexed? ~Sebastian