[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745191#comment-16745191
 ] 

Sebastian Nagel commented on NUTCH-2676:
----------------------------------------

Thanks, [~virt], the PR looks promising! If done, please, don't forget to apply 
the Eclipse code formatter and to squash the commits.

??I have also added a possibility to ignore robots.txt file. Let me know if 
needed to push it.??

The robots.txt rules are checked outside before delegating the page fetch to 
the protocol implementation. And there is already an option to whitelist 
hosts/domains (NUTCH-1927). So there should be no need for that.

Regarding the redirects: if you want to follow redirects immediately in the 
fetcher you simply could adjust `http.redirect.max` (e.g., set it to 3) and 
Fetcher will follow the redirects immediately.

Btw., for quick testing you could just set the required parameters in the 
command-line, e.g.:
{noformat}
% bin/nutch parsechecker -Dplugin.includes='protocol-selenium|parse-tika' \
   -Dselenium.grid.binary=.../geckodriver \
   -Dselenium.enable.headless=true \
   -followRedirects \
   -dumpText https://nutch.apache.org{noformat}
 

> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2676
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2676
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: 1.15
>            Reporter: Stas Batururimi
>            Priority: Major
>             Fix For: 1.16
>
>         Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to