Reading nutch/src/plugin/lib-selenium/howto_upgrade_selenium.md at master ·
apache/nutch
<https://github.com/apache/nutch/blob/master/src/plugin/lib-selenium/howto_upgrade_selenium.md>
seems
the upgrade might not be straight forward.
Tried to unpack selenium.dev drivers build to plugins/lib-selenium and
update plugins.xml with newer versions, but now I am getting the error.
Seems that the selenium build is looking for other path.
Any thoughts how to upgrade selenium with use of their binary builds?

crawler@debian:~/apache-nutch-1.20$ ./bin/nutch parsechecker
-Dplugin.includes='protocol-selenium|parse-tika'
-Dselenium.enable.headless=true -Dlibselenium.page.load.delay=120
-Dpage.load.delay=120
-Dselenium.driver=chrome
-Dwebdriver.chrome.driver=plugins/lib-selenium/selenium-chrome-driver-4.27.0.jar
-followRedirects -dumpText https://metais.slovensko.sk
........
Fetch failed with protocol status: exception(16), lastModified=0:
java.lang.RuntimeException:
org.openqa.selenium.remote.NoSuchDriverException: Unable to obtain:
chromedriver, error Command failed with code: 65, executed: [--browser,
chrome, --browser-path, /root/chromedriver, --language-binding, java,
--output, json]
Browser path does not exist: /root/chromedriver
Build info: version: '4.27.0', revision: 'd6e718d'
System info: os.name: 'Linux', os.arch: 'amd64', os.version:
'6.1.0-28-amd64', java.version: '17.0.13'
Driver info: driver.version: ChromeDriver




On Sat, Dec 21, 2024 at 11:39 AM Peter Viskup <[email protected]>
wrote:

> Hi Sebastian,
> I used the firefox driver (head and headless with the same output).
> Now tried chrome, but the Selenium Driver didn't match the browser's one.
>
> Dec 21, 2024 11:29:50 AM org.openqa.selenium.devtools.CdpVersionFinder
> findNearestMatch
> WARNING: Unable to find CDP implementation matching 131
> Dec 21, 2024 11:29:50 AM org.openqa.selenium.chromium.ChromiumDriver
> lambda$new$5
> WARNING: Unable to find version of CDP to use for 131.0.6778.204. You may
> need to include a dependency on a specific version of the CDP using
> something similar to `org.seleniumhq.selenium:selenium-devtools-v86:4.18.1`
> where the version ("v86") matches the version of the chromium-based browser
> you're using and the version number of the artifact is the same as
> Selenium's.
>
>
>
> The environment:
>  - Debian 12
> crawler@debian:~/apache-nutch-1.20$ dpkg -l|awk
> /'openjdk|chromium|firefox/'
> ii  chromium                              131.0.6778.204-1~deb12u1
>     amd64        web browser
> ii  chromium-common                       131.0.6778.204-1~deb12u1
>     amd64        web browser - common resources used by the chromium
> packages
> ii  chromium-sandbox                      131.0.6778.204-1~deb12u1
>     amd64        web browser - setuid security sandbox for chromium
> ii  firefox-esr                           128.5.0esr-1~deb12u1
>     amd64        Mozilla Firefox web browser - Extended Support Release
> (ESR)
> ii  openjdk-17-jdk:amd64                  17.0.13+11-2~deb12u1
>     amd64        OpenJDK Development Kit (JDK)
> ii  openjdk-17-jdk-headless:amd64         17.0.13+11-2~deb12u1
>     amd64        OpenJDK Development Kit (JDK) (headless)
> ii  openjdk-17-jre:amd64                  17.0.13+11-2~deb12u1
>     amd64        OpenJDK Java runtime, using Hotspot JIT
> ii  openjdk-17-jre-headless:amd64         17.0.13+11-2~deb12u1
>     amd64        OpenJDK Java runtime, using Hotspot JIT (headless)
>
> Nutch 1.20
> crawler@debian:~/apache-nutch-1.20$ ls -la plugins/lib-selenium/|awk
> '/java|fire|chrom/'
> -rw-rw-r--  1 crawler crawler   15248 Apr  9  2024
> selenium-chrome-driver-4.18.1.jar
> -rw-rw-r--  1 crawler crawler   36726 Apr  9  2024
> selenium-chromium-driver-4.18.1.jar
> -rw-rw-r--  1 crawler crawler   83279 Apr  9  2024
> selenium-firefox-driver-4.18.1.jar
> -rw-rw-r--  1 crawler crawler     545 Apr  9  2024 selenium-java-4.18.1.jar
>
> On Thu, Dec 19, 2024 at 10:53 PM Sebastian Nagel <[email protected]>
> wrote:
>
>> Hi Peter,
>>
>> the best description for the Selenium plugin is the README.md [1].
>>
>> Otherwise, could you share which Selenium driver is used?
>>
>> Thanks,
>> Sebastian
>>
>> [1]
>>
>> https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/README.md
>>
>> On 12/17/24 21:07, Peter Viskup wrote:
>> > Just not able to get it working...
>> > At first I got selenium timeout exception even
>> > with libselenium.page.load.delay set. The solution was to increase the
>> > value of page.load.delay which was default of 3.
>> >
>> > Then I stucked with the output of Selenium which shows "You need to
>> enable
>> > JavaScript".
>> >
>> > Am running the nutch with command:
>> > ./bin/nutch parsechecker
>> -Dplugin.includes='protocol-selenium|parse-tika' \
>> >   -Dselenium.enable.headless=true \
>> >   -Dlibselenium.page.load.delay=120 \
>> >   -Dpage.load.delay=120 \
>> >   -followRedirects -dumpText https://metais.slovensko.sk
>> >
>> > Went through the source code of libselenium and selenium protocol
>> plugins
>> > with no success.
>> >
>> > What else to try to get such page crawled?
>> >
>> > Peter
>> >
>>
>>

Reply via email to