Hi Sebastian,
I used the firefox driver (head and headless with the same output).
Now tried chrome, but the Selenium Driver didn't match the browser's one.
Dec 21, 2024 11:29:50 AM org.openqa.selenium.devtools.CdpVersionFinder
findNearestMatch
WARNING: Unable to find CDP implementation matching 131
Dec 21, 2024 11:29:50 AM org.openqa.selenium.chromium.ChromiumDriver
lambda$new$5
WARNING: Unable to find version of CDP to use for 131.0.6778.204. You may
need to include a dependency on a specific version of the CDP using
something similar to `org.seleniumhq.selenium:selenium-devtools-v86:4.18.1`
where the version ("v86") matches the version of the chromium-based browser
you're using and the version number of the artifact is the same as
Selenium's.
The environment:
- Debian 12
crawler@debian:~/apache-nutch-1.20$ dpkg -l|awk /'openjdk|chromium|firefox/'
ii chromium 131.0.6778.204-1~deb12u1
amd64 web browser
ii chromium-common 131.0.6778.204-1~deb12u1
amd64 web browser - common resources used by the chromium packages
ii chromium-sandbox 131.0.6778.204-1~deb12u1
amd64 web browser - setuid security sandbox for chromium
ii firefox-esr 128.5.0esr-1~deb12u1
amd64 Mozilla Firefox web browser - Extended Support Release (ESR)
ii openjdk-17-jdk:amd64 17.0.13+11-2~deb12u1
amd64 OpenJDK Development Kit (JDK)
ii openjdk-17-jdk-headless:amd64 17.0.13+11-2~deb12u1
amd64 OpenJDK Development Kit (JDK) (headless)
ii openjdk-17-jre:amd64 17.0.13+11-2~deb12u1
amd64 OpenJDK Java runtime, using Hotspot JIT
ii openjdk-17-jre-headless:amd64 17.0.13+11-2~deb12u1
amd64 OpenJDK Java runtime, using Hotspot JIT (headless)
Nutch 1.20
crawler@debian:~/apache-nutch-1.20$ ls -la plugins/lib-selenium/|awk
'/java|fire|chrom/'
-rw-rw-r-- 1 crawler crawler 15248 Apr 9 2024
selenium-chrome-driver-4.18.1.jar
-rw-rw-r-- 1 crawler crawler 36726 Apr 9 2024
selenium-chromium-driver-4.18.1.jar
-rw-rw-r-- 1 crawler crawler 83279 Apr 9 2024
selenium-firefox-driver-4.18.1.jar
-rw-rw-r-- 1 crawler crawler 545 Apr 9 2024 selenium-java-4.18.1.jar
On Thu, Dec 19, 2024 at 10:53 PM Sebastian Nagel <[email protected]> wrote:
> Hi Peter,
>
> the best description for the Selenium plugin is the README.md [1].
>
> Otherwise, could you share which Selenium driver is used?
>
> Thanks,
> Sebastian
>
> [1]
>
> https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/README.md
>
> On 12/17/24 21:07, Peter Viskup wrote:
> > Just not able to get it working...
> > At first I got selenium timeout exception even
> > with libselenium.page.load.delay set. The solution was to increase the
> > value of page.load.delay which was default of 3.
> >
> > Then I stucked with the output of Selenium which shows "You need to
> enable
> > JavaScript".
> >
> > Am running the nutch with command:
> > ./bin/nutch parsechecker
> -Dplugin.includes='protocol-selenium|parse-tika' \
> > -Dselenium.enable.headless=true \
> > -Dlibselenium.page.load.delay=120 \
> > -Dpage.load.delay=120 \
> > -followRedirects -dumpText https://metais.slovensko.sk
> >
> > Went through the source code of libselenium and selenium protocol plugins
> > with no success.
> >
> > What else to try to get such page crawled?
> >
> > Peter
> >
>
>