What I ended up doing is - Developed a service to fetch pages (Used nodejs with Google Puppeteer https://pptr.dev/ for fetching). - Used browserless (https://www.browserless.io/) and made fetch to use live chromium browser instances - Scaled this all in the Kubernetes cluster so we can fetch many pages simultaneously. - Developed a plugin for Nutch which uses a fetch service to fetch pages.
This is better solution that using HTMLUnit or Selenium (as compared to puppeteer which works great) On Sun, Aug 13, 2023 at 2:53 PM Markus Jelsma <markus.jel...@openindex.io> wrote: > Hello Raj, > > I see. Unfortunately turning on Javascript supporting protocol plugins such > as Htmlunit or Selenium does not always solve the problem > > Maybe you can ask at the Selenium project about this. They are the experts > on that particular problem. > > Regards, > Markus > > Op di 1 aug 2023 om 19:38 schreef Raj Chidara <raj.chid...@ddismart.com>: > > > Hello Markus > > Now, I have removed all other protocol-* and given only > > protocol-selenium. Now it crawled few pages. However, there is no > content > > read from pages. All pages are shown as only with text *Home* > > > > Thanks and Regards > > Raj Chidara > > > > > > > > ---- On Mon, 30 Jan 2023 18:35:06 +0530 *Markus Jelsma > > <markus.jel...@openindex.io <markus.jel...@openindex.io>>* wrote --- > > > > Yes, remove the other protocol-* plugins from the configuration. With all > > three active it is not always determined which one is going to do the > > work. > > > > Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara < > raj.chid...@ddismart.com>: > > > > > > > > > > Hello Markus > > > Sorry for duplicate question. I added selenium plugin in > > > conf/nutch-default.xml and included following > > > > > > <name>plugin.includes</name> > > > > > > > > > <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > > > > > > > Still the site is not crawling. Are there any additional steps to be > > > followed for installation of selenium. Please suggest > > > > > > > > > Thanks and Regards > > > > > > Raj Chidara > > > > > > ----- Original Message ----- > > > From: Markus Jelsma (markus.jel...@openindex.io) > > > Date: 30-01-2023 16:26 > > > To: user@nutch.apache.org > > > Subject: Re: Siet is not crawling > > > > > > Hello Raj, > > > > > > I think the same question about the same site was asked here some time > > ago. > > > Anyway, this site loads its content via Javascript. You will need a > > > protocol plugin that supports it, either protocol-htmlunit, or > > > protocol-selenium, instead of protocol-http or any other. > > > > > > Change the configuration for plugin.includes, and it should work. > > > > > > Markus > > > > > > Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara < > > raj.chid...@ddismart.com > > > >: > > > > > > > > > > > Hello, > > > > > > > > Nutch is not able crawl this site. Are there any nutch configuration > > > > changes required for this site? > > > > > > > > https://www.ich.org/ > > > > > > > > > > > > Thanks and Regards > > > > > > > > Raj Chidara > > > > > > > > > > > > > > > > > > > > > > > > > > >