[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635365#comment-14635365 ]
ASF GitHub Bot commented on NUTCH-2062: --------------------------------------- GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/46 NUTCH-2062 - Interactive Selenium Plugin - Extend lib-selenium to allow for external interaction with the WebDriver. - Add Interactive Selenium plugin so users can create a Selenium Handler that does custom interaction with the page being fetched. Handlers are required to implement a simple interface and then can be included in crawls by adjusting the configuration. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MJJoyce/nutch NUTCH-2062 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/46.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #46 ---- commit e1be2cf55b06d7e17e83ef74a53587807024adf4 Author: Michael Joyce <mltjo...@gmail.com> Date: 2015-07-20T16:00:44Z NUTCH-2062 - lib-selenium interaction extension - Add ability for lib-selenium to pass off driver handling to caller. getDriverForPage loads a WebDriver for a given page and returns it to the caller. getHTMLContent takes a WebDriver and returns the body content to the caller. These changes will allow a plugin to control the interaction with the WebDriver to get at the data required for a particular page. commit c12eb9ae88d91fd6f9e6dcebd6dc0dd04d12a9ae Author: Michael Joyce <mltjo...@gmail.com> Date: 2015-07-20T17:17:49Z NUTCH-2062 - Add default lib-selenium timeout to config commit 2df485b1c1a6c5b4df22882f709de4f4c1b6732a Author: Michael Joyce <mltjo...@gmail.com> Date: 2015-07-20T17:18:46Z NUTCH-2062 - Add configurable wait to lib-selenium - You can now configure the delay that Selenium waits for a page to load by configuring the libselenium.page.load.delay parameter in nutch-default. The setting defaults to 3 seconds in lib-selenium if the parameter isn't available. commit 8737084752ff8e92c4c4eef668e6ce0ca612f7fb Author: Michael Joyce <mltjo...@gmail.com> Date: 2015-07-21T16:16:42Z Add interactive Selenium plugin ---- > Add Plugin for interacting with Selenium WebDriver > -------------------------------------------------- > > Key: NUTCH-2062 > URL: https://issues.apache.org/jira/browse/NUTCH-2062 > Project: Nutch > Issue Type: Improvement > Components: plugin > Affects Versions: 1.10 > Reporter: Michael Joyce > Fix For: 1.11 > > > The protocol-selenium plugin is great for pulling webpages that dynamically > load content. However, I've run into use cases where I need to actively > interact with a page in Selenium before it becomes useful. For instance, I > may need to paginate through a table to get all results that I'm interested > in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)