RE: Nutch not crawling all URLs

Roseline Antai Wed, 16 Feb 2022 06:47:09 -0800

Hi,

Just continuing this thread, I tried the Selenium plugin as suggested below. I 
have copied over the nutch-site.xml file to show the parameters set for the 
selenium plugin below. I have taken most of the descriptions out for brevity:


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value>Esid Crawler</value>
</property>
<property>
  <name>http.agent.email</name>
  <value>roselineantai at gmail dot com</value>
</property>
<property>
  <name>http.agent.url</name>
  <value>http://esid.shinyapps.io/ESID/ </value>
</property>
<property>
  <name>db.ignore.also.redirects</name>
  <value>false</value>
  <description>I
  </description>
</property>
<property>
  <name>db.fetch.interval.default</name>
  <value>30</value>
  <description>The default number of seconds between re-fetches of a page (30 
days).
  </description>
</property>
<property>
    <name>db.ignore.internal.links</name>
    <value>false</value>
</property>
<property>
    <name>db.ignore.external.links</name>
    <value>true</value>
</property>
<property>
    <name>parser.skip.truncated</name>
    <value>false</value>
    <description>Boolean value for whether we should skip parsing for truncated 
documents. By default this
        property is activated due to extremely high levels of CPU which parsing 
can sometimes take.
    </description>
</property>
 <property>
   <name>db.max.outlinks.per.page</name>
   <value>-1</value>
   <description>
   </description>
 </property>
<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>
  </description>
</property>
<property>
  <name>db.ignore.external.links.mode</name>
  <value>byHost</value>
</property>
<property>
  <name>db.injector.overwrite</name>
  <value>true</value>
</property>
<property>
<property>
  <name>http.timeout</name>
  <value>100000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>
<property>
    <name>plugin.includes</name>
    
<value>protocol-selenium|urlfilter-regex|parse-tika|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value>
    <description>
    </description>
</property>
<property>
  <name>selenium.driver</name>
  <value>chrome</value>
  <description>
      </description>
</property>

<property>
  <name>selenium.take.screenshot</name>
  <value>false</value>
  <description>
    </description>
</property>

<property>
  <name>selenium.screenshot.location</name>
  <value></value>
  <description>
      </description>
</property>

<property>
  <name>selenium.hub.port</name>
  <value>4444</value>
  <description>Selenium Hub Location connection port</description>
</property>

<property>
  <name>selenium.hub.path</name>
  <value>/wd/hub</value>
  <description>Selenium Hub Location connection path</description>
</property>

<property>
  <name>selenium.hub.host</name>
  <value>localhost</value>
  <description>Selenium Hub Location connection host</description>
</property>

<property>
<property>
  <name>selenium.hub.protocol</name>
  <value>http</value>
  <description>Selenium Hub Location connection protocol</description>
</property>

<property>
  <name>selenium.grid.driver</name>
  <value>chrome</value>
  <description> </description>
</property>

<property>
  <name>selenium.grid.binary</name>
  <value>/usr/bin/chromedriver</value>
  <description>
 </description>
</property>

<!-- lib-selenium configuration -->
<property>
  <name>libselenium.page.load.delay</name>
  <value>3</value>
  <description>
    </description>
</property>
<property>
  <name>webdriver.chrome.driver</name>
  <value>/root/chromedriver</value>
  <description>The path to the ChromeDriver binary</description>
</property>

<!-- headless options for Firefox and Chrome-->
<property>
  <name>selenium.enable.headless</name>
  <value>true</value>
  <description>A Boolean value representing the headless option
    for Firefix and Chrome drivers
  </description>
</property>
</configuration>


When I tested the setup using this: 

bin/nutch parsechecker \
  -Dplugin.includes='protocol-selenium|parse-tika' \
  -Dselenium.grid.binary=/path/to/selenium/chromedriver  \
  -Dselenium.driver=chrome \
  -Dselenium.enable.headless=true \
  -followRedirects -dumpText  URL

With some of the problematic URLs, they all came out well on the console. They 
however were quite a number of URLs identified as outlinks. But when I run the 
full crawl with this plug-in, it appears to show some data in Solr, but I have 
been unable to extract any data. It gives '0' as count of what has been 
crawled, for all the URLs. This is quite worrying, because without the plugin, 
I did manage to get data from about half of the URLs. The performance is way 
worse than it should be. I'm also confused because testing some of the sites 
with the example I was given above works.

Below is a sample of the errors I got from the log files. Please have a look at 
them and let me know if there is a parameter I'm not setting properly:

2022-02-15 01:49:02,093 ERROR tika.TikaParser - Problem loading custom Tika 
configuration from tika-config.xml

java.lang.NumberFormatException: For input string: ""

2022-02-15 13:29:21,331 ERROR selenium.Http - Failed to get protocol output
java.lang.RuntimeException: org.openqa.selenium.WebDriverException: unknown 
error: net::ERR_NAME_NOT_RESOLVED
  (Session info: headless chrome=96.0.4664.110)
  
Caused by: org.openqa.selenium.WebDriverException: unknown error: 
net::ERR_NAME_NOT_RESOLVED
  (Session info: headless chrome=96.0.4664.110)
  
*** Element info: {Using=tag name, value=body}
2022-02-15 13:29:23,971 ERROR selenium.Http - Failed to get protocol output
java.lang.RuntimeException: org.openqa.selenium.NoSuchElementException: no such 
element: Unable to locate element: {"method":"css selector","selector":"body"}
  (Session info: headless chrome=96.0.4664.110)
For documentation on this error, please visit: 
http://seleniumhq.org/exceptions/no_such_element.html

2022-02-15 13:29:23,972 INFO  fetcher.FetcherThread - FetcherThread 71 fetch of 
http://ialab.com.ar/ failed with: java.lang.RuntimeException: 
org.openqa.selenium.NoSuchEle>  (Session info: headless chrome=96.0.4664.110)
For documentation on this error, please visit: 
http://seleniumhq.org/exceptions/no_such_element.html

2022-02-15 13:29:27,648 ERROR selenium.HttpWebClient - Selenium WebDriver: 
Timeout Exception: Capturing whatever loaded so far...

2022-02-15 13:32:42,713 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'outlink', using default

2022-02-15 13:33:23,664 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'fetcher', using default

2022-02-15 13:36:23,347 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and >2022-02-15 13:36:23,479 ERROR tika.TikaParser - Problem 
loading custom Tika configuration from tika-config.xml
java.lang.NumberFormatException: For input string: ""

2022-02-15 13:36:25,540 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and >2022-02-15 13:36:25,540 ERROR tika.TikaParser - Can't 
retrieve Tika parser for mime-type application/x-bibtex-text-file
2022-02-15 13:36:25,542 WARN  parse.ParseSegment - Error parsing: 
http://www.saiph.org/docs/loco.bibtex: failed(2,0): Can't retrieve Tika parser 
for mime-type application/>

2022-02-15 13:36:26,374 ERROR tika.TikaParser - Can't retrieve Tika parser for 
mime-type application/javascript


Regards,
Roseline


The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.


-----Original Message-----
From: Roseline Antai <roseline.an...@strath.ac.uk> 
Sent: 13 January 2022 17:02
To: user@nutch.apache.org; Sebastian Nagel <wastl.na...@googlemail.com>
Subject: RE: Nutch not crawling all URLs

Thank you Sebastian.

I will try these.

Kind regards,
Roseline



-----Original Message-----
From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID>
Sent: 13 January 2022 12:33
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

Hi Roseline,

> Does it work at all with Chrome?

Yes.

> It seems you need to have some form of GUI to run it?

You need graphics libraries but not necessarily a graphical system.
Normally, you run the browser in headless mode without a graphical device 
(monitor) attached.

> Is there some documentation or tutorial on this?

The README is probably the best documentation:
  src/plugin/protocol-selenium/README.md
  
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-selenium&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7Cf62dcc933fcf4587d5b308d9d6b67c71%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776902655981791%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=nwbGFCjop9xYCzUvojH%2F0wFhwIla1ilLjD9iVGrn4Nc%3D&amp;reserved=0

After installing chromium and the Selenium chromedriver, you can test whether 
it works by running:

bin/nutch parsechecker \
  -Dplugin.includes='protocol-selenium|parse-tika' \
  -Dselenium.grid.binary=/path/to/selenium/chromedriver  \
  -Dselenium.driver=chrome \
  -Dselenium.enable.headless=true \
  -followRedirects -dumpText  URL


Caveat: because browsers are updated frequently, you may need to use a recent 
driver version and eventually also upgrade the Selenium dependencies in Nutch.
Let us know if you need help here.


> My use case is Text mining  and Machine Learning classification. I'm 
> indexing into Solr and then transferring the indexed data to MongoDB 
> for further processing.

Well, that's not an untypical use case for Nutch. And it's a long pipeline:
fetching, HTML parsing, extracting content fields, indexing. Nutch is able to 
perform all steps. But I'd agree that browser-based crawling isn't that easy to 
set up with Nutch.

Best,
Sebastian

On 1/12/22 17:53, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you. I did enjoy the holiday. Hope you did too. 
> 
> I have had a look at the protocol-selenium plugin, but it was a bit difficult 
> to understand. It appears it only works with Firefox. Does it work at all 
> with Chrome? I was also not sure of what values to set for the properties. It 
> seems you need to have some form of GUI to run it?
> 
> Is there some documentation or tutorial on this? My guess is that some of the 
> pages might not be crawling because of JavaScript. I might be wrong, but 
> would want to test that.
> 
> I think would be quite good for my use case because I am trying to implement 
> broad crawling. 
> 
> My use case is Text mining  and Machine Learning classification. I'm indexing 
> into Solr and then transferring the indexed data to MongoDB for further 
> processing.
> 
> Kind regards,
> Roseline
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID>
> Sent: 12 January 2022 16:12
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> Hi Roseline,
> 
>> the mail below went to my junk folder and I didn't see it.
> 
> No problem. I hope you nevertheless enjoyed the holidays.
> And sorry for any delays but I want to emphasize that Nutch is a community 
> project and in doubt it might take a few days until somebody finds the time 
> to respond.
> 
>> Could you confirm if you received all the urls I sent?
> 
> I've tried a view URLs you sent but not all of them. And to figure out the 
> reason why a site isn't crawled may take some time.
> 
>> Another question I have about Nutch is if it has problems with 
>> crawling javascript pages?
> 
> By default Nutch does not execute Javascript.
> 
> There is a protocol plugin (protocol-selenium) to fetch pages with a web 
> browser between Nutch and the crawled sites. This way Javascript pages can be 
> crawled for the price of some overhead in setting up the crawler and network 
> traffic to fetch the page dependencies (CSS, Javascript, images).
> 
>> I would ideally love to make the crawler work for my URLs than start 
>> checking for other crawlers and waste all the work so far.
> 
> Well, Nutch is for sure a good crawler. But as always: there are many other 
> crawlers which might be better adapted to a specific use case.
> 
> What's your use case? Indexing into Solr or Elasticsearch?
> Text mining? Archiving content?
> 
> Best,
> Sebastian
> 
> On 1/12/22 12:13, Roseline Antai wrote:
>> Hi Sebastian,
>>
>> For some reason, the mail below went to my junk folder and I didn't see it.
>>
>> The notco page - 
>> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2F&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7Cf62dcc933fcf4587d5b308d9d6b67c71%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776902655981791%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=s1bCYMQSA%2FVV%2B2IPWHmd8wCgRVgeGRWBWcJfmr%2FALQI%3D&amp;reserved=0
>>   was not indexed, no. When I enabled redirects, I was able to get a few 
>> pages, but they don't seem valid.
>>
>> Could you confirm if you received all the urls I sent?
>>
>> Another question I have about Nutch is if it has problems with crawling 
>> javascript pages?
>>
>> I would ideally love to make the crawler work for my URLs than start 
>> checking for other crawlers and waste all the work so far.
>>
>> Just adding again, this is what my nutch-site.xml looks like:
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>  <name>http.agent.name</name>
>>  <value>Nutch Crawler</value>
>> </property>
>> <property>
>> <name>http.agent.email</name>                         
>> <value>datalake.ng at gmail d</value> </property> <property>
>>     <name>db.ignore.internal.links</name>
>>     <value>false</value>
>> </property>
>> <property>
>>     <name>db.ignore.external.links</name>
>>     <value>true</value>
>> </property>
>> <property>
>>   <name>plugin.includes</name>
>>   
>> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|a
>> n
>> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|lang
>> u
>> age-identifier</value>
>> </property>
>> <property>
>>     <name>parser.skip.truncated</name>
>>     <value>false</value>
>>     <description>Boolean value for whether we should skip parsing for 
>> truncated documents. By default this
>>         property is activated due to extremely high levels of CPU which 
>> parsing can sometimes take.
>>     </description>
>> </property>
>>  <property>
>>    <name>db.max.outlinks.per.page</name>
>>    <value>-1</value>
>>    <description>The maximum number of outlinks that we'll process for a page.
>>    If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
>> outlinks
>>    will be processed for a page; otherwise, all outlinks will be processed.
>>    </description>
>>  </property>
>> <property>
>>   <name>http.content.limit</name>
>>   <value>-1</value>
>>   <description>The length limit for downloaded content using the http://
>>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>>   than it will be truncated; otherwise, no truncation at all. Do not
>>   confuse this setting with the file.content.limit setting.
>>   </description>
>> </property>
>> <property>
>>   <name>db.ignore.external.links.mode</name>
>>   <value>byHost</value>
>> </property>
>> <property>
>>   <name>db.injector.overwrite</name>
>>   <value>true</value>
>> </property>
>> <property>
>>   <name>http.timeout</name>
>>   <value>50000</value>
>>   <description>The default network timeout, in 
>> milliseconds.</description> </property> </configuration>
>>
>> Regards,
>> Roseline
>>
>> -----Original Message-----
>> From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID>
>> Sent: 13 December 2021 17:35
>> To: user@nutch.apache.org
>> Subject: Re: Nutch not crawling all URLs
>>
>> CAUTION: This email originated outside the University. Check before clicking 
>> links or attachments.
>>
>> Hi Roseline,
>>
>>> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%2
>>> 53
>>> A
>>> %2F%2Fwww.notco.com%2F&amp;data=04%7C01%7Croseline.antai%40strath.ac.
>>> uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5
>>> 9
>>> 44e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4
>>> w
>>> LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sd
>>> a
>>> ta=uUPYYLqNHBFSDozeSLODQTFwJiVJu7EPdccRlsMalE0%3D&amp;reserved=0
>>
>> What is the status for   
>> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhich&amp;data=04%7C01%7Croseline.antai%40strath.ac.uk%7Cf62dcc933fcf4587d5b308d9d6b67c71%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776902655981791%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=I5F2hmBppG8iaZmoKA2Q6NjOztCBnxgcKppdoJK7qHA%3D&amp;reserved=0
>>  is the final redirect
>> target?
>> Is the target page indexed?
>>
>> ~Sebastian
>>

RE: Nutch not crawling all URLs

Reply via email to