Reading nutch/src/plugin/lib-selenium/howto_upgrade_selenium.md at master · apache/nutch <https://github.com/apache/nutch/blob/master/src/plugin/lib-selenium/howto_upgrade_selenium.md> seems the upgrade might not be straight forward. Tried to unpack selenium.dev drivers build to plugins/lib-selenium and update plugins.xml with newer versions, but now I am getting the error. Seems that the selenium build is looking for other path. Any thoughts how to upgrade selenium with use of their binary builds?
crawler@debian:~/apache-nutch-1.20$ ./bin/nutch parsechecker -Dplugin.includes='protocol-selenium|parse-tika' -Dselenium.enable.headless=true -Dlibselenium.page.load.delay=120 -Dpage.load.delay=120 -Dselenium.driver=chrome -Dwebdriver.chrome.driver=plugins/lib-selenium/selenium-chrome-driver-4.27.0.jar -followRedirects -dumpText https://metais.slovensko.sk ........ Fetch failed with protocol status: exception(16), lastModified=0: java.lang.RuntimeException: org.openqa.selenium.remote.NoSuchDriverException: Unable to obtain: chromedriver, error Command failed with code: 65, executed: [--browser, chrome, --browser-path, /root/chromedriver, --language-binding, java, --output, json] Browser path does not exist: /root/chromedriver Build info: version: '4.27.0', revision: 'd6e718d' System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.0-28-amd64', java.version: '17.0.13' Driver info: driver.version: ChromeDriver On Sat, Dec 21, 2024 at 11:39 AM Peter Viskup <[email protected]> wrote: > Hi Sebastian, > I used the firefox driver (head and headless with the same output). > Now tried chrome, but the Selenium Driver didn't match the browser's one. > > Dec 21, 2024 11:29:50 AM org.openqa.selenium.devtools.CdpVersionFinder > findNearestMatch > WARNING: Unable to find CDP implementation matching 131 > Dec 21, 2024 11:29:50 AM org.openqa.selenium.chromium.ChromiumDriver > lambda$new$5 > WARNING: Unable to find version of CDP to use for 131.0.6778.204. You may > need to include a dependency on a specific version of the CDP using > something similar to `org.seleniumhq.selenium:selenium-devtools-v86:4.18.1` > where the version ("v86") matches the version of the chromium-based browser > you're using and the version number of the artifact is the same as > Selenium's. > > > > The environment: > - Debian 12 > crawler@debian:~/apache-nutch-1.20$ dpkg -l|awk > /'openjdk|chromium|firefox/' > ii chromium 131.0.6778.204-1~deb12u1 > amd64 web browser > ii chromium-common 131.0.6778.204-1~deb12u1 > amd64 web browser - common resources used by the chromium > packages > ii chromium-sandbox 131.0.6778.204-1~deb12u1 > amd64 web browser - setuid security sandbox for chromium > ii firefox-esr 128.5.0esr-1~deb12u1 > amd64 Mozilla Firefox web browser - Extended Support Release > (ESR) > ii openjdk-17-jdk:amd64 17.0.13+11-2~deb12u1 > amd64 OpenJDK Development Kit (JDK) > ii openjdk-17-jdk-headless:amd64 17.0.13+11-2~deb12u1 > amd64 OpenJDK Development Kit (JDK) (headless) > ii openjdk-17-jre:amd64 17.0.13+11-2~deb12u1 > amd64 OpenJDK Java runtime, using Hotspot JIT > ii openjdk-17-jre-headless:amd64 17.0.13+11-2~deb12u1 > amd64 OpenJDK Java runtime, using Hotspot JIT (headless) > > Nutch 1.20 > crawler@debian:~/apache-nutch-1.20$ ls -la plugins/lib-selenium/|awk > '/java|fire|chrom/' > -rw-rw-r-- 1 crawler crawler 15248 Apr 9 2024 > selenium-chrome-driver-4.18.1.jar > -rw-rw-r-- 1 crawler crawler 36726 Apr 9 2024 > selenium-chromium-driver-4.18.1.jar > -rw-rw-r-- 1 crawler crawler 83279 Apr 9 2024 > selenium-firefox-driver-4.18.1.jar > -rw-rw-r-- 1 crawler crawler 545 Apr 9 2024 selenium-java-4.18.1.jar > > On Thu, Dec 19, 2024 at 10:53 PM Sebastian Nagel <[email protected]> > wrote: > >> Hi Peter, >> >> the best description for the Selenium plugin is the README.md [1]. >> >> Otherwise, could you share which Selenium driver is used? >> >> Thanks, >> Sebastian >> >> [1] >> >> https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/README.md >> >> On 12/17/24 21:07, Peter Viskup wrote: >> > Just not able to get it working... >> > At first I got selenium timeout exception even >> > with libselenium.page.load.delay set. The solution was to increase the >> > value of page.load.delay which was default of 3. >> > >> > Then I stucked with the output of Selenium which shows "You need to >> enable >> > JavaScript". >> > >> > Am running the nutch with command: >> > ./bin/nutch parsechecker >> -Dplugin.includes='protocol-selenium|parse-tika' \ >> > -Dselenium.enable.headless=true \ >> > -Dlibselenium.page.load.delay=120 \ >> > -Dpage.load.delay=120 \ >> > -followRedirects -dumpText https://metais.slovensko.sk >> > >> > Went through the source code of libselenium and selenium protocol >> plugins >> > with no success. >> > >> > What else to try to get such page crawled? >> > >> > Peter >> > >> >>

