What I ended up doing is
- Developed a service to fetch pages (Used nodejs with Google Puppeteer
https://pptr.dev/ for fetching).
- Used browserless (https://www.browserless.io/) and made fetch to use live
chromium browser instances
- Scaled this all in the Kubernetes cluster so we can fetch many pages
simultaneously.
- Developed a plugin for Nutch which uses a fetch service to fetch pages.

This is better solution that using HTMLUnit or Selenium (as compared to
puppeteer which works great)


On Sun, Aug 13, 2023 at 2:53 PM Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hello Raj,
>
> I see. Unfortunately turning on Javascript supporting protocol plugins such
> as Htmlunit or Selenium does not always solve the problem
>
> Maybe you can ask at the Selenium project about this. They are the experts
> on that particular problem.
>
> Regards,
> Markus
>
> Op di 1 aug 2023 om 19:38 schreef Raj Chidara <raj.chid...@ddismart.com>:
>
> > Hello Markus
> >   Now, I have removed all other protocol-* and given only
> > protocol-selenium.  Now it crawled few pages.  However, there is no
> content
> > read from pages.  All pages are shown as only with text *Home*
> >
> > Thanks and Regards
> > Raj Chidara
> >
> >
> >
> > ---- On Mon, 30 Jan 2023 18:35:06 +0530 *Markus Jelsma
> > <markus.jel...@openindex.io <markus.jel...@openindex.io>>* wrote ---
> >
> > Yes, remove the other protocol-* plugins from the configuration. With all
> > three active it is not always determined which one is going to do the
> > work.
> >
> > Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara <
> raj.chid...@ddismart.com>:
> >
> >
> > >
> > > Hello Markus
> > > Sorry for duplicate question. I added selenium plugin in
> > > conf/nutch-default.xml and included following
> > >
> > > <name>plugin.includes</name>
> > >
> > >
> >
> <value>protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >
> > >
> > > Still the site is not crawling. Are there any additional steps to be
> > > followed for installation of selenium. Please suggest
> > >
> > >
> > > Thanks and Regards
> > >
> > > Raj Chidara
> > >
> > > ----- Original Message -----
> > > From: Markus Jelsma (markus.jel...@openindex.io)
> > > Date: 30-01-2023 16:26
> > > To: user@nutch.apache.org
> > > Subject: Re: Siet is not crawling
> > >
> > > Hello Raj,
> > >
> > > I think the same question about the same site was asked here some time
> > ago.
> > > Anyway, this site loads its content via Javascript. You will need a
> > > protocol plugin that supports it, either protocol-htmlunit, or
> > > protocol-selenium, instead of protocol-http or any other.
> > >
> > > Change the configuration for plugin.includes, and it should work.
> > >
> > > Markus
> > >
> > > Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <
> > raj.chid...@ddismart.com
> > > >:
> > >
> > > >
> > > > Hello,
> > > >
> > > > Nutch is not able crawl this site. Are there any nutch configuration
> > > > changes required for this site?
> > > >
> > > > https://www.ich.org/
> > > >
> > > >
> > > > Thanks and Regards
> > > >
> > > > Raj Chidara
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
> >
>

Reply via email to