Re: Nutch not crawling all URLs

Sebastian Nagel Wed, 12 Jan 2022 08:12:43 -0800

Hi Roseline,

> the mail below went to my junk folder and I didn't see it.


No problem. I hope you nevertheless enjoyed the holidays.
And sorry for any delays but I want to emphasize that Nutch is
a community project and in doubt it might take a few days
until somebody finds the time to respond.

> Could you confirm if you received all the urls I sent?

I've tried a view URLs you sent but not all of them. And to figure out the
reason why a site isn't crawled may take some time.

> Another question I have about Nutch is if it has problems with crawling
> javascript pages?

By default Nutch does not execute Javascript.

There is a protocol plugin (protocol-selenium) to fetch pages with a web
browser between Nutch and the crawled sites. This way Javascript pages
can be crawled for the price of some overhead in setting up the crawler and
network traffic to fetch the page dependencies (CSS, Javascript, images).

> I would ideally love to make the crawler work for my URLs than start checking
> for other crawlers and waste all the work so far.

Well, Nutch is for sure a good crawler. But as always: there are many
other crawlers which might be better adapted to a specific use case.

What's your use case? Indexing into Solr or Elasticsearch?
Text mining? Archiving content?

Best,
Sebastian

On 1/12/22 12:13, Roseline Antai wrote:
> Hi Sebastian,
> 
> For some reason, the mail below went to my junk folder and I didn't see it.
> 
> The notco page - https://notco.com/  was not indexed, no. When I enabled 
> redirects, I was able to get a few pages, but they don't seem valid.
> 
> Could you confirm if you received all the urls I sent?
> 
> Another question I have about Nutch is if it has problems with crawling 
> javascript pages?
> 
> I would ideally love to make the crawler work for my URLs than start checking 
> for other crawlers and waste all the work so far.
> 
> Just adding again, this is what my nutch-site.xml looks like:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> <property>
>  <name>http.agent.name</name>
>  <value>Nutch Crawler</value>
> </property>
> <property>
> <name>http.agent.email</name>                         
> <value>datalake.ng at gmail d</value> 
> </property>
> <property>
>     <name>db.ignore.internal.links</name>
>     <value>false</value>
> </property>
> <property>
>     <name>db.ignore.external.links</name>
>     <value>true</value>
> </property>
> <property>
>   <name>plugin.includes</name>
>   
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value>
> </property>
> <property>
>     <name>parser.skip.truncated</name>
>     <value>false</value>
>     <description>Boolean value for whether we should skip parsing for 
> truncated documents. By default this
>         property is activated due to extremely high levels of CPU which 
> parsing can sometimes take.
>     </description>
> </property>
>  <property>
>    <name>db.max.outlinks.per.page</name>
>    <value>-1</value>
>    <description>The maximum number of outlinks that we'll process for a page.
>    If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
> outlinks
>    will be processed for a page; otherwise, all outlinks will be processed.
>    </description>
>  </property>
> <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
>   <description>The length limit for downloaded content using the http://
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   </description>
> </property>
> <property>
>   <name>db.ignore.external.links.mode</name>
>   <value>byHost</value>
> </property>
> <property>
>   <name>db.injector.overwrite</name>
>   <value>true</value>
> </property>
> <property>
>   <name>http.timeout</name>
>   <value>50000</value>
>   <description>The default network timeout, in milliseconds.</description>
> </property>
> </configuration>
> 
> Regards,
> Roseline
> 
> -----Original Message-----
> From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> 
> Sent: 13 December 2021 17:35
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> CAUTION: This email originated outside the University. Check before clicking 
> links or attachments.
> 
> Hi Roseline,
> 
>> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.notco.com%2F&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=rPjrY5Lr3LWwK0%2BB%2FOibIDmKHGjvQRntpN6jCb4iZRs%3D&amp;reserved=0
> 
> What is the status for   
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhich&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2FAsVkcpGQhNGDGvpdZ7stxEaPM%2BQlrEfsWhZOnJEhZQ%3D&amp;reserved=0
>  is the final redirect
> target?
> Is the target page indexed?
> 
> ~Sebastian
>

Re: Nutch not crawling all URLs

Reply via email to