RE: Nutch not crawling all URLs

Roseline Antai Wed, 12 Jan 2022 03:13:50 -0800

Hi Sebastian,

For some reason, the mail below went to my junk folder and I didn't see it.


The notco page - https://notco.com/  was not indexed, no. When I enabled 
redirects, I was able to get a few pages, but they don't seem valid.

Could you confirm if you received all the urls I sent?

Another question I have about Nutch is if it has problems with crawling 
javascript pages?

I would ideally love to make the crawler work for my URLs than start checking 
for other crawlers and waste all the work so far.

Just adding again, this is what my nutch-site.xml looks like:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
 <name>http.agent.name</name>
 <value>Nutch Crawler</value>
</property>
<property>
<name>http.agent.email</name>                         
<value>datalake.ng at gmail d</value> 
</property>
<property>
    <name>db.ignore.internal.links</name>
    <value>false</value>
</property>
<property>
    <name>db.ignore.external.links</name>
    <value>true</value>
</property>
<property>
  <name>plugin.includes</name>
  
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value>
</property>
<property>
    <name>parser.skip.truncated</name>
    <value>false</value>
    <description>Boolean value for whether we should skip parsing for truncated 
documents. By default this
        property is activated due to extremely high levels of CPU which parsing 
can sometimes take.
    </description>
</property>
 <property>
   <name>db.max.outlinks.per.page</name>
   <value>-1</value>
   <description>The maximum number of outlinks that we'll process for a page.
   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
   will be processed for a page; otherwise, all outlinks will be processed.
   </description>
 </property>
<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>
<property>
  <name>db.ignore.external.links.mode</name>
  <value>byHost</value>
</property>
<property>
  <name>db.injector.overwrite</name>
  <value>true</value>
</property>
<property>
  <name>http.timeout</name>
  <value>50000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>
</configuration>

Regards,
Roseline

-----Original Message-----
From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> 
Sent: 13 December 2021 17:35
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

CAUTION: This email originated outside the University. Check before clicking 
links or attachments.

Hi Roseline,

> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.notco.com%2F&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=rPjrY5Lr3LWwK0%2BB%2FOibIDmKHGjvQRntpN6jCb4iZRs%3D&amp;reserved=0

What is the status for   
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhich&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2FAsVkcpGQhNGDGvpdZ7stxEaPM%2BQlrEfsWhZOnJEhZQ%3D&amp;reserved=0
 is the final redirect
target?
Is the target page indexed?

~Sebastian

RE: Nutch not crawling all URLs

Reply via email to