[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635365#comment-14635365
 ] 

ASF GitHub Bot commented on NUTCH-2062:
---------------------------------------

GitHub user MJJoyce opened a pull request:

    https://github.com/apache/nutch/pull/46

    NUTCH-2062 - Interactive Selenium Plugin

    - Extend lib-selenium to allow for external interaction with the WebDriver.
    - Add Interactive Selenium plugin so users can create a Selenium Handler 
that does custom interaction with the page being fetched. Handlers are required 
to implement a simple interface and then can be included in crawls by adjusting 
the configuration.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MJJoyce/nutch NUTCH-2062

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/46.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #46
    
----
commit e1be2cf55b06d7e17e83ef74a53587807024adf4
Author: Michael Joyce <mltjo...@gmail.com>
Date:   2015-07-20T16:00:44Z

    NUTCH-2062 - lib-selenium interaction extension
    
    - Add ability for lib-selenium to pass off driver handling to caller.
      getDriverForPage loads a WebDriver for a given page and returns it to
      the caller. getHTMLContent takes a WebDriver and returns the body
      content to the caller. These changes will allow a plugin to control
      the interaction with the WebDriver to get at the data required for a
      particular page.

commit c12eb9ae88d91fd6f9e6dcebd6dc0dd04d12a9ae
Author: Michael Joyce <mltjo...@gmail.com>
Date:   2015-07-20T17:17:49Z

    NUTCH-2062 - Add default lib-selenium timeout to config

commit 2df485b1c1a6c5b4df22882f709de4f4c1b6732a
Author: Michael Joyce <mltjo...@gmail.com>
Date:   2015-07-20T17:18:46Z

    NUTCH-2062 - Add configurable wait to lib-selenium
    
    - You can now configure the delay that Selenium waits for a page to load
      by configuring the libselenium.page.load.delay parameter in
      nutch-default. The setting defaults to 3 seconds in lib-selenium if
      the parameter isn't available.

commit 8737084752ff8e92c4c4eef668e6ce0ca612f7fb
Author: Michael Joyce <mltjo...@gmail.com>
Date:   2015-07-21T16:16:42Z

    Add interactive Selenium plugin

----


> Add Plugin for interacting with Selenium WebDriver
> --------------------------------------------------
>
>                 Key: NUTCH-2062
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2062
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.10
>            Reporter: Michael Joyce
>             Fix For: 1.11
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically 
> load content. However, I've run into use cases where I need to actively 
> interact with a page in Selenium before it becomes useful. For instance, I 
> may need to paginate through a table to get all results that I'm interested 
> in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to