RE: Nutch not crawling all URLs

Roseline Antai Wed, 12 Jan 2022 08:53:52 -0800

Hi Sebastian,

Thank you. I did enjoy the holiday. Hope you did too.


I have had a look at the protocol-selenium plugin, but it was a bit difficult 
to understand. It appears it only works with Firefox. Does it work at all with 
Chrome? I was also not sure of what values to set for the properties. It seems 
you need to have some form of GUI to run it?

Is there some documentation or tutorial on this? My guess is that some of the 
pages might not be crawling because of JavaScript. I might be wrong, but would 
want to test that.

I think would be quite good for my use case because I am trying to implement 
broad crawling. 

My use case is Text mining  and Machine Learning classification. I'm indexing 
into Solr and then transferring the indexed data to MongoDB for further 
processing.

Kind regards,
Roseline





-----Original Message-----
From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> 
Sent: 12 January 2022 16:12
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

Hi Roseline,

> the mail below went to my junk folder and I didn't see it.

No problem. I hope you nevertheless enjoyed the holidays.
And sorry for any delays but I want to emphasize that Nutch is a community 
project and in doubt it might take a few days until somebody finds the time to 
respond.

> Could you confirm if you received all the urls I sent?

I've tried a view URLs you sent but not all of them. And to figure out the 
reason why a site isn't crawled may take some time.

> Another question I have about Nutch is if it has problems with 
> crawling javascript pages?

By default Nutch does not execute Javascript.

There is a protocol plugin (protocol-selenium) to fetch pages with a web 
browser between Nutch and the crawled sites. This way Javascript pages can be 
crawled for the price of some overhead in setting up the crawler and network 
traffic to fetch the page dependencies (CSS, Javascript, images).

> I would ideally love to make the crawler work for my URLs than start 
> checking for other crawlers and waste all the work so far.

Well, Nutch is for sure a good crawler. But as always: there are many other 
crawlers which might be better adapted to a specific use case.

What's your use case? Indexing into Solr or Elasticsearch?
Text mining? Archiving content?

Best,
Sebastian

On 1/12/22 12:13, Roseline Antai wrote:
> Hi Sebastian,
> 
> For some reason, the mail below went to my junk folder and I didn't see it.
> 
> The notco page - 
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2F&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2BZq7R6H954Q9u6Xt%2FnkeHYEKjx4rhFF62PvP2dQEW5U%3D&amp;reserved=0
>   was not indexed, no. When I enabled redirects, I was able to get a few 
> pages, but they don't seem valid.
> 
> Could you confirm if you received all the urls I sent?
> 
> Another question I have about Nutch is if it has problems with crawling 
> javascript pages?
> 
> I would ideally love to make the crawler work for my URLs than start checking 
> for other crawlers and waste all the work so far.
> 
> Just adding again, this is what my nutch-site.xml looks like:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> <property>
>  <name>http.agent.name</name>
>  <value>Nutch Crawler</value>
> </property>
> <property>
> <name>http.agent.email</name>                         
> <value>datalake.ng at gmail d</value> </property> <property>
>     <name>db.ignore.internal.links</name>
>     <value>false</value>
> </property>
> <property>
>     <name>db.ignore.external.links</name>
>     <value>true</value>
> </property>
> <property>
>   <name>plugin.includes</name>
>   
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu
> age-identifier</value>
> </property>
> <property>
>     <name>parser.skip.truncated</name>
>     <value>false</value>
>     <description>Boolean value for whether we should skip parsing for 
> truncated documents. By default this
>         property is activated due to extremely high levels of CPU which 
> parsing can sometimes take.
>     </description>
> </property>
>  <property>
>    <name>db.max.outlinks.per.page</name>
>    <value>-1</value>
>    <description>The maximum number of outlinks that we'll process for a page.
>    If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
> outlinks
>    will be processed for a page; otherwise, all outlinks will be processed.
>    </description>
>  </property>
> <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
>   <description>The length limit for downloaded content using the http://
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   </description>
> </property>
> <property>
>   <name>db.ignore.external.links.mode</name>
>   <value>byHost</value>
> </property>
> <property>
>   <name>db.injector.overwrite</name>
>   <value>true</value>
> </property>
> <property>
>   <name>http.timeout</name>
>   <value>50000</value>
>   <description>The default network timeout, in 
> milliseconds.</description> </property> </configuration>
> 
> Regards,
> Roseline
> 
> -----Original Message-----
> From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID>
> Sent: 13 December 2021 17:35
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> CAUTION: This email originated outside the University. Check before clicking 
> links or attachments.
> 
> Hi Roseline,
> 
>> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A
>> %2F%2Fwww.notco.com%2F&amp;data=04%7C01%7Croseline.antai%40strath.ac.
>> uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee59
>> 44e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4w
>> LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sda
>> ta=uUPYYLqNHBFSDozeSLODQTFwJiVJu7EPdccRlsMalE0%3D&amp;reserved=0
> 
> What is the status for   
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhich&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=4%2FkRRs6KQMV7LP6y0cOTdRyTcbtHSu5iRaekyhVyu28%3D&amp;reserved=0
>  is the final redirect
> target?
> Is the target page indexed?
> 
> ~Sebastian
>

RE: Nutch not crawling all URLs

Reply via email to