[jira] [Work started] (NUTCH-2246) Refactor /seed endpoint for backward compatibility
[ https://issues.apache.org/jira/browse/NUTCH-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2246 started by Sujen Shah. - > Refactor /seed endpoint for backward compatibility > -- > > Key: NUTCH-2246 > URL: https://issues.apache.org/jira/browse/NUTCH-2246 > Project: Nutch > Issue Type: Sub-task > Components: REST_api >Affects Versions: 1.12 >Reporter: Sujen Shah >Assignee: Sujen Shah >Priority: Minor > Labels: memex > Fix For: 1.13 > > > Currently the seed endpoint allows you to create a seed list by providing a > list of urls passed as an argument. > After the first refactor here - > https://issues.apache.org/jira/browse/NUTCH-2090. User could no longer > provide a physical path to the seedlist. > Nutch should give both options to the user. > Additionally, once a seedlist is created by providing a list of urls (not a > physical file), Nutch should store it like it does for the configurations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404613#comment-15404613 ] Sujen Shah commented on NUTCH-2132: --- Updated PR and cleaned the commit log. One issue I have not been able to tackle yet is adding dependencies in the plugin. When I add the amqp-client library as a dependency in the main ivy.xml it works fine, but as I remove it from there it put it in the plugin specific ivy.xml it does not work. It is not able to find the jar in classpath during runtime and throws a NoClassDefFound error. Though there exists a amqp-client jar in runtime/local/plugin/publish-rabbitmq directory. Any suggestions how to resolve this ? > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.13 > > Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, > PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404523#comment-15404523 ] Sujen Shah commented on NUTCH-2132: --- Okay, got it working. Turns out I had forgotten to add it in plugin.includes and hence no logs or exceptions. Will update the PR. Thanks! > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.13 > > Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, > PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404414#comment-15404414 ] Sujen Shah edited comment on NUTCH-2132 at 8/2/16 5:21 PM: --- It does not throw any exceptions, but when I check the number of plugins that are loaded its 0. So I dont know whats going wrong ? This is what I am using to load the plugins - https://github.com/sujen1412/nutch/blob/2c484ec4789c84f7bf9e592e15c96cf788ef5967/src/java/org/apache/nutch/publisher/NutchPublishers.java#L33-L34 was (Author: sujenshah): It does not throw any exceptions, but when I check the number of plugins that are loaded its 0. > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.13 > > Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, > PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404414#comment-15404414 ] Sujen Shah commented on NUTCH-2132: --- It does not throw any exceptions, but when I check the number of plugins that are loaded its 0. > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.13 > > Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, > PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404363#comment-15404363 ] Chris A. Mattmann commented on NUTCH-2132: -- Sujen what comes back - is success or an exception printed? > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.13 > > Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, > PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404198#comment-15404198 ] Sujen Shah commented on NUTCH-2132: --- Hi Everyone, I have created an initial PR for this https://github.com/apache/nutch/pull/138. Things I have changed: 1. Removed the hard dependency on RabbitMQ 2. Created an interface for a new plugin extension 3. Moved the RabbitMQ code as a new plugin I need help in getting the plugin working. I registered the new plugin extension and tried to develop the publisher plugin in the same way the scoring filters work. This may help more than one publisher queue implementations to work together. Then I registered the plugin developed for rabbitmq. The code builds and the fetcher runs fine, but without the publisher working correctly. The issue I face is in getting the RabbitMQ plugin loaded in the NutchPublishers class. The setConfig() method is not able to load any plugins. Link to code - https://github.com/sujen1412/nutch/blob/2c484ec4789c84f7bf9e592e15c96cf788ef5967/src/java/org/apache/nutch/publisher/NutchPublishers.java#L38-L54 What am I missing here ? Thanks for the help :) > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.13 > > Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, > PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404179#comment-15404179 ] ASF GitHub Bot commented on NUTCH-2132: --- GitHub user sujen1412 opened a pull request: https://github.com/apache/nutch/pull/138 Fix for NUTCH-2132: Publisher/Subscriber model for Nutch to emit events This PR is still in progress and needs a review to get the plugin system working. It is not ready to commit as of yet. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujen1412/nutch NUTCH-2132 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/138.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #138 commit 5a13301a7808d852b88d8724c3c3b7783fa9d2be Author: Sujen Shah Date: 2015-10-02T08:03:35Z Added dependency for RabbitMQ commit 3029ef4055834b72d63e7fc516eca448c2efe32a Author: Sujen Shah Date: 2015-10-02T08:04:39Z Code for FetcherThreadPublisher commit ebfd7728e650cc7648a3939d3985826eefde93f3 Author: Sujen Shah Date: 2015-10-02T08:05:11Z Added property descriptions in nutch-default.xml commit 445fcc2d766ddef7cc36783ebcaecb552e4f2819 Author: Sujen Shah Date: 2015-10-02T08:28:52Z Added support for queue routing key commit 44498308634dda99f543d5e18724ad8cfeb16343 Author: Sujen Shah Date: 2015-10-19T08:44:29Z Added properties to make publisher optional in nutch-default.xml commit ad88c94fc274576aacaa2c17b1f55a087f7a04f9 Author: Sujen Shah Date: 2015-10-27T02:40:45Z Added routingkey support commit e380de803c8c129f0dfb7d8c31a8596b4ceae8bf Author: Sujen Shah Date: 2015-10-29T20:12:55Z Better exception handling when RMQ server is down commit e4f5e13cc5675f9e7d37ebda39bf230c08baf4b8 Author: Sujen Shah Date: 2016-08-02T15:01:48Z Created plugin system for pub/sub implementation in Nutch commit 2c484ec4789c84f7bf9e592e15c96cf788ef5967 Author: Sujen Shah Date: 2016-08-02T15:12:28Z Removed Rabbitmq dependency from ivy.xml and remove author tags > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.13 > > Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, > PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request #138: Fix for NUTCH-2132: Publisher/Subscriber model for ...
GitHub user sujen1412 opened a pull request: https://github.com/apache/nutch/pull/138 Fix for NUTCH-2132: Publisher/Subscriber model for Nutch to emit events This PR is still in progress and needs a review to get the plugin system working. It is not ready to commit as of yet. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujen1412/nutch NUTCH-2132 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/138.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #138 commit 5a13301a7808d852b88d8724c3c3b7783fa9d2be Author: Sujen Shah Date: 2015-10-02T08:03:35Z Added dependency for RabbitMQ commit 3029ef4055834b72d63e7fc516eca448c2efe32a Author: Sujen Shah Date: 2015-10-02T08:04:39Z Code for FetcherThreadPublisher commit ebfd7728e650cc7648a3939d3985826eefde93f3 Author: Sujen Shah Date: 2015-10-02T08:05:11Z Added property descriptions in nutch-default.xml commit 445fcc2d766ddef7cc36783ebcaecb552e4f2819 Author: Sujen Shah Date: 2015-10-02T08:28:52Z Added support for queue routing key commit 44498308634dda99f543d5e18724ad8cfeb16343 Author: Sujen Shah Date: 2015-10-19T08:44:29Z Added properties to make publisher optional in nutch-default.xml commit ad88c94fc274576aacaa2c17b1f55a087f7a04f9 Author: Sujen Shah Date: 2015-10-27T02:40:45Z Added routingkey support commit e380de803c8c129f0dfb7d8c31a8596b4ceae8bf Author: Sujen Shah Date: 2015-10-29T20:12:55Z Better exception handling when RMQ server is down commit e4f5e13cc5675f9e7d37ebda39bf230c08baf4b8 Author: Sujen Shah Date: 2016-08-02T15:01:48Z Created plugin system for pub/sub implementation in Nutch commit 2c484ec4789c84f7bf9e592e15c96cf788ef5967 Author: Sujen Shah Date: 2016-08-02T15:12:28Z Removed Rabbitmq dependency from ivy.xml and remove author tags --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---