Hi, Just continuing this thread, I tried the Selenium plugin as suggested below. I have copied over the nutch-site.xml file to show the parameters set for the selenium plugin below. I have taken most of the descriptions out for brevity:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>Esid Crawler</value> </property> <property> <name>http.agent.email</name> <value>roselineantai at gmail dot com</value> </property> <property> <name>http.agent.url</name> <value>http://esid.shinyapps.io/ESID/ </value> </property> <property> <name>db.ignore.also.redirects</name> <value>false</value> <description>I </description> </property> <property> <name>db.fetch.interval.default</name> <value>30</value> <description>The default number of seconds between re-fetches of a page (30 days). </description> </property> <property> <name>db.ignore.internal.links</name> <value>false</value> </property> <property> <name>db.ignore.external.links</name> <value>true</value> </property> <property> <name>parser.skip.truncated</name> <value>false</value> <description>Boolean value for whether we should skip parsing for truncated documents. By default this property is activated due to extremely high levels of CPU which parsing can sometimes take. </description> </property> <property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description> </description> </property> <property> <name>http.content.limit</name> <value>-1</value> <description> </description> </property> <property> <name>db.ignore.external.links.mode</name> <value>byHost</value> </property> <property> <name>db.injector.overwrite</name> <value>true</value> </property> <property> <property> <name>http.timeout</name> <value>100000</value> <description>The default network timeout, in milliseconds.</description> </property> <property> <name>plugin.includes</name> <value>protocol-selenium|urlfilter-regex|parse-tika|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier</value> <description> </description> </property> <property> <name>selenium.driver</name> <value>chrome</value> <description> </description> </property> <property> <name>selenium.take.screenshot</name> <value>false</value> <description> </description> </property> <property> <name>selenium.screenshot.location</name> <value></value> <description> </description> </property> <property> <name>selenium.hub.port</name> <value>4444</value> <description>Selenium Hub Location connection port</description> </property> <property> <name>selenium.hub.path</name> <value>/wd/hub</value> <description>Selenium Hub Location connection path</description> </property> <property> <name>selenium.hub.host</name> <value>localhost</value> <description>Selenium Hub Location connection host</description> </property> <property> <property> <name>selenium.hub.protocol</name> <value>http</value> <description>Selenium Hub Location connection protocol</description> </property> <property> <name>selenium.grid.driver</name> <value>chrome</value> <description> </description> </property> <property> <name>selenium.grid.binary</name> <value>/usr/bin/chromedriver</value> <description> </description> </property> <!-- lib-selenium configuration --> <property> <name>libselenium.page.load.delay</name> <value>3</value> <description> </description> </property> <property> <name>webdriver.chrome.driver</name> <value>/root/chromedriver</value> <description>The path to the ChromeDriver binary</description> </property> <!-- headless options for Firefox and Chrome--> <property> <name>selenium.enable.headless</name> <value>true</value> <description>A Boolean value representing the headless option for Firefix and Chrome drivers </description> </property> </configuration> When I tested the setup using this: bin/nutch parsechecker \ -Dplugin.includes='protocol-selenium|parse-tika' \ -Dselenium.grid.binary=/path/to/selenium/chromedriver \ -Dselenium.driver=chrome \ -Dselenium.enable.headless=true \ -followRedirects -dumpText URL With some of the problematic URLs, they all came out well on the console. They however were quite a number of URLs identified as outlinks. But when I run the full crawl with this plug-in, it appears to show some data in Solr, but I have been unable to extract any data. It gives '0' as count of what has been crawled, for all the URLs. This is quite worrying, because without the plugin, I did manage to get data from about half of the URLs. The performance is way worse than it should be. I'm also confused because testing some of the sites with the example I was given above works. Below is a sample of the errors I got from the log files. Please have a look at them and let me know if there is a parameter I'm not setting properly: 2022-02-15 01:49:02,093 ERROR tika.TikaParser - Problem loading custom Tika configuration from tika-config.xml java.lang.NumberFormatException: For input string: "" 2022-02-15 13:29:21,331 ERROR selenium.Http - Failed to get protocol output java.lang.RuntimeException: org.openqa.selenium.WebDriverException: unknown error: net::ERR_NAME_NOT_RESOLVED (Session info: headless chrome=96.0.4664.110) Caused by: org.openqa.selenium.WebDriverException: unknown error: net::ERR_NAME_NOT_RESOLVED (Session info: headless chrome=96.0.4664.110) *** Element info: {Using=tag name, value=body} 2022-02-15 13:29:23,971 ERROR selenium.Http - Failed to get protocol output java.lang.RuntimeException: org.openqa.selenium.NoSuchElementException: no such element: Unable to locate element: {"method":"css selector","selector":"body"} (Session info: headless chrome=96.0.4664.110) For documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html 2022-02-15 13:29:23,972 INFO fetcher.FetcherThread - FetcherThread 71 fetch of http://ialab.com.ar/ failed with: java.lang.RuntimeException: org.openqa.selenium.NoSuchEle> (Session info: headless chrome=96.0.4664.110) For documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html 2022-02-15 13:29:27,648 ERROR selenium.HttpWebClient - Selenium WebDriver: Timeout Exception: Capturing whatever loaded so far... 2022-02-15 13:32:42,713 INFO regex.RegexURLNormalizer - can't find rules for scope 'outlink', using default 2022-02-15 13:33:23,664 INFO regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default 2022-02-15 13:36:23,347 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and >2022-02-15 13:36:23,479 ERROR tika.TikaParser - Problem loading custom Tika configuration from tika-config.xml java.lang.NumberFormatException: For input string: "" 2022-02-15 13:36:25,540 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and >2022-02-15 13:36:25,540 ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type application/x-bibtex-text-file 2022-02-15 13:36:25,542 WARN parse.ParseSegment - Error parsing: http://www.saiph.org/docs/loco.bibtex: failed(2,0): Can't retrieve Tika parser for mime-type application/> 2022-02-15 13:36:26,374 ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type application/javascript Regards, Roseline The University of Strathclyde is a charitable body, registered in Scotland, number SC015263. -----Original Message----- From: Roseline Antai <roseline.an...@strath.ac.uk> Sent: 13 January 2022 17:02 To: user@nutch.apache.org; Sebastian Nagel <wastl.na...@googlemail.com> Subject: RE: Nutch not crawling all URLs Thank you Sebastian. I will try these. Kind regards, Roseline -----Original Message----- From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> Sent: 13 January 2022 12:33 To: user@nutch.apache.org Subject: Re: Nutch not crawling all URLs Hi Roseline, > Does it work at all with Chrome? Yes. > It seems you need to have some form of GUI to run it? You need graphics libraries but not necessarily a graphical system. Normally, you run the browser in headless mode without a graphical device (monitor) attached. > Is there some documentation or tutorial on this? The README is probably the best documentation: src/plugin/protocol-selenium/README.md https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-selenium&data=04%7C01%7Croseline.antai%40strath.ac.uk%7Cf62dcc933fcf4587d5b308d9d6b67c71%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776902655981791%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nwbGFCjop9xYCzUvojH%2F0wFhwIla1ilLjD9iVGrn4Nc%3D&reserved=0 After installing chromium and the Selenium chromedriver, you can test whether it works by running: bin/nutch parsechecker \ -Dplugin.includes='protocol-selenium|parse-tika' \ -Dselenium.grid.binary=/path/to/selenium/chromedriver \ -Dselenium.driver=chrome \ -Dselenium.enable.headless=true \ -followRedirects -dumpText URL Caveat: because browsers are updated frequently, you may need to use a recent driver version and eventually also upgrade the Selenium dependencies in Nutch. Let us know if you need help here. > My use case is Text mining and Machine Learning classification. I'm > indexing into Solr and then transferring the indexed data to MongoDB > for further processing. Well, that's not an untypical use case for Nutch. And it's a long pipeline: fetching, HTML parsing, extracting content fields, indexing. Nutch is able to perform all steps. But I'd agree that browser-based crawling isn't that easy to set up with Nutch. Best, Sebastian On 1/12/22 17:53, Roseline Antai wrote: > Hi Sebastian, > > Thank you. I did enjoy the holiday. Hope you did too. > > I have had a look at the protocol-selenium plugin, but it was a bit difficult > to understand. It appears it only works with Firefox. Does it work at all > with Chrome? I was also not sure of what values to set for the properties. It > seems you need to have some form of GUI to run it? > > Is there some documentation or tutorial on this? My guess is that some of the > pages might not be crawling because of JavaScript. I might be wrong, but > would want to test that. > > I think would be quite good for my use case because I am trying to implement > broad crawling. > > My use case is Text mining and Machine Learning classification. I'm indexing > into Solr and then transferring the indexed data to MongoDB for further > processing. > > Kind regards, > Roseline > > > > > > -----Original Message----- > From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> > Sent: 12 January 2022 16:12 > To: user@nutch.apache.org > Subject: Re: Nutch not crawling all URLs > > Hi Roseline, > >> the mail below went to my junk folder and I didn't see it. > > No problem. I hope you nevertheless enjoyed the holidays. > And sorry for any delays but I want to emphasize that Nutch is a community > project and in doubt it might take a few days until somebody finds the time > to respond. > >> Could you confirm if you received all the urls I sent? > > I've tried a view URLs you sent but not all of them. And to figure out the > reason why a site isn't crawled may take some time. > >> Another question I have about Nutch is if it has problems with >> crawling javascript pages? > > By default Nutch does not execute Javascript. > > There is a protocol plugin (protocol-selenium) to fetch pages with a web > browser between Nutch and the crawled sites. This way Javascript pages can be > crawled for the price of some overhead in setting up the crawler and network > traffic to fetch the page dependencies (CSS, Javascript, images). > >> I would ideally love to make the crawler work for my URLs than start >> checking for other crawlers and waste all the work so far. > > Well, Nutch is for sure a good crawler. But as always: there are many other > crawlers which might be better adapted to a specific use case. > > What's your use case? Indexing into Solr or Elasticsearch? > Text mining? Archiving content? > > Best, > Sebastian > > On 1/12/22 12:13, Roseline Antai wrote: >> Hi Sebastian, >> >> For some reason, the mail below went to my junk folder and I didn't see it. >> >> The notco page - >> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2F&data=04%7C01%7Croseline.antai%40strath.ac.uk%7Cf62dcc933fcf4587d5b308d9d6b67c71%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776902655981791%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=s1bCYMQSA%2FVV%2B2IPWHmd8wCgRVgeGRWBWcJfmr%2FALQI%3D&reserved=0 >> was not indexed, no. When I enabled redirects, I was able to get a few >> pages, but they don't seem valid. >> >> Could you confirm if you received all the urls I sent? >> >> Another question I have about Nutch is if it has problems with crawling >> javascript pages? >> >> I would ideally love to make the crawler work for my URLs than start >> checking for other crawlers and waste all the work so far. >> >> Just adding again, this is what my nutch-site.xml looks like: >> <?xml version="1.0"?> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> >> <!-- Put site-specific property overrides in this file. --> >> >> <configuration> >> <property> >> <name>http.agent.name</name> >> <value>Nutch Crawler</value> >> </property> >> <property> >> <name>http.agent.email</name> >> <value>datalake.ng at gmail d</value> </property> <property> >> <name>db.ignore.internal.links</name> >> <value>false</value> >> </property> >> <property> >> <name>db.ignore.external.links</name> >> <value>true</value> >> </property> >> <property> >> <name>plugin.includes</name> >> >> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|a >> n >> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|lang >> u >> age-identifier</value> >> </property> >> <property> >> <name>parser.skip.truncated</name> >> <value>false</value> >> <description>Boolean value for whether we should skip parsing for >> truncated documents. By default this >> property is activated due to extremely high levels of CPU which >> parsing can sometimes take. >> </description> >> </property> >> <property> >> <name>db.max.outlinks.per.page</name> >> <value>-1</value> >> <description>The maximum number of outlinks that we'll process for a page. >> If this value is nonnegative (>=0), at most db.max.outlinks.per.page >> outlinks >> will be processed for a page; otherwise, all outlinks will be processed. >> </description> >> </property> >> <property> >> <name>http.content.limit</name> >> <value>-1</value> >> <description>The length limit for downloaded content using the http:// >> protocol, in bytes. If this value is nonnegative (>=0), content longer >> than it will be truncated; otherwise, no truncation at all. Do not >> confuse this setting with the file.content.limit setting. >> </description> >> </property> >> <property> >> <name>db.ignore.external.links.mode</name> >> <value>byHost</value> >> </property> >> <property> >> <name>db.injector.overwrite</name> >> <value>true</value> >> </property> >> <property> >> <name>http.timeout</name> >> <value>50000</value> >> <description>The default network timeout, in >> milliseconds.</description> </property> </configuration> >> >> Regards, >> Roseline >> >> -----Original Message----- >> From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> >> Sent: 13 December 2021 17:35 >> To: user@nutch.apache.org >> Subject: Re: Nutch not crawling all URLs >> >> CAUTION: This email originated outside the University. Check before clicking >> links or attachments. >> >> Hi Roseline, >> >>> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%2 >>> 53 >>> A >>> %2F%2Fwww.notco.com%2F&data=04%7C01%7Croseline.antai%40strath.ac. >>> uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5 >>> 9 >>> 44e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4 >>> w >>> LjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sd >>> a >>> ta=uUPYYLqNHBFSDozeSLODQTFwJiVJu7EPdccRlsMalE0%3D&reserved=0 >> >> What is the status for >> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhich&data=04%7C01%7Croseline.antai%40strath.ac.uk%7Cf62dcc933fcf4587d5b308d9d6b67c71%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776902655981791%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=I5F2hmBppG8iaZmoKA2Q6NjOztCBnxgcKppdoJK7qHA%3D&reserved=0 >> is the final redirect >> target? >> Is the target page indexed? >> >> ~Sebastian >>