[jira] [Work started] (NUTCH-2246) Refactor /seed endpoint for backward compatibility

2016-08-02 Thread Sujen Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2246 started by Sujen Shah.
-
> Refactor /seed endpoint for backward compatibility
> --
>
> Key: NUTCH-2246
> URL: https://issues.apache.org/jira/browse/NUTCH-2246
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Affects Versions: 1.12
>Reporter: Sujen Shah
>Assignee: Sujen Shah
>Priority: Minor
>  Labels: memex
> Fix For: 1.13
>
>
> Currently the seed endpoint allows you to create a seed list by providing a 
> list of urls passed as an argument. 
> After the first refactor here - 
> https://issues.apache.org/jira/browse/NUTCH-2090. User could no longer 
> provide a physical path to the seedlist. 
> Nutch should give both options to the user.
> Additionally, once a seedlist is created by providing a list of urls (not a 
> physical file), Nutch should store it like it does for the configurations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-08-02 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404613#comment-15404613
 ] 

Sujen Shah commented on NUTCH-2132:
---

Updated PR and cleaned the commit log. 

One issue I have not been able to tackle yet is adding dependencies in the 
plugin. When I add the amqp-client library as a dependency in the main ivy.xml 
it works fine, but as I remove it from there it put it in the plugin specific 
ivy.xml it does not work. It is not able to find the jar in classpath during 
runtime and throws a NoClassDefFound error. Though there exists a amqp-client 
jar in runtime/local/plugin/publish-rabbitmq directory. 

Any suggestions how to resolve this ? 

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.13
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-08-02 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404523#comment-15404523
 ] 

Sujen Shah commented on NUTCH-2132:
---

Okay, got it working. Turns out I had forgotten to add it in plugin.includes 
and hence no logs or exceptions. 
Will update the PR. 

Thanks!

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.13
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-08-02 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404414#comment-15404414
 ] 

Sujen Shah edited comment on NUTCH-2132 at 8/2/16 5:21 PM:
---

It does not throw any exceptions, but when I check the number of plugins that 
are loaded its 0. So I dont know whats going wrong ? 

This is what I am using to load the plugins - 
https://github.com/sujen1412/nutch/blob/2c484ec4789c84f7bf9e592e15c96cf788ef5967/src/java/org/apache/nutch/publisher/NutchPublishers.java#L33-L34


was (Author: sujenshah):
It does not throw any exceptions, but when I check the number of plugins that 
are loaded its 0. 

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.13
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-08-02 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404414#comment-15404414
 ] 

Sujen Shah commented on NUTCH-2132:
---

It does not throw any exceptions, but when I check the number of plugins that 
are loaded its 0. 

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.13
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-08-02 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404363#comment-15404363
 ] 

Chris A. Mattmann commented on NUTCH-2132:
--

Sujen what comes back - is success or an exception printed?

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.13
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-08-02 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404198#comment-15404198
 ] 

Sujen Shah commented on NUTCH-2132:
---

Hi Everyone, 
I have created an initial PR for this https://github.com/apache/nutch/pull/138. 
Things I have changed: 
1. Removed the hard dependency on RabbitMQ 
2. Created an interface for a new plugin extension 
3. Moved the RabbitMQ code as a new plugin 

I need help in getting the plugin working. I registered the new plugin 
extension and tried to develop the publisher plugin in the same way the scoring 
filters work. This may help more than one publisher queue implementations to 
work together. Then I registered the plugin developed for rabbitmq. The code 
builds and the fetcher runs fine, but without the publisher working correctly. 

The issue I face is in getting the RabbitMQ plugin loaded in the 
NutchPublishers class. The setConfig() method is not able to load any plugins. 
Link to code - 
https://github.com/sujen1412/nutch/blob/2c484ec4789c84f7bf9e592e15c96cf788ef5967/src/java/org/apache/nutch/publisher/NutchPublishers.java#L38-L54

What am I missing here ? 

Thanks for the help :) 

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.13
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-08-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404179#comment-15404179
 ] 

ASF GitHub Bot commented on NUTCH-2132:
---

GitHub user sujen1412 opened a pull request:

https://github.com/apache/nutch/pull/138

Fix for NUTCH-2132: Publisher/Subscriber model for Nutch to emit events

This PR is still in progress and needs a review to get the plugin system 
working. It is not ready to commit as of yet.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sujen1412/nutch NUTCH-2132

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/138.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #138


commit 5a13301a7808d852b88d8724c3c3b7783fa9d2be
Author: Sujen Shah 
Date:   2015-10-02T08:03:35Z

Added dependency for RabbitMQ

commit 3029ef4055834b72d63e7fc516eca448c2efe32a
Author: Sujen Shah 
Date:   2015-10-02T08:04:39Z

Code for FetcherThreadPublisher

commit ebfd7728e650cc7648a3939d3985826eefde93f3
Author: Sujen Shah 
Date:   2015-10-02T08:05:11Z

Added property descriptions in nutch-default.xml

commit 445fcc2d766ddef7cc36783ebcaecb552e4f2819
Author: Sujen Shah 
Date:   2015-10-02T08:28:52Z

Added support for queue routing key

commit 44498308634dda99f543d5e18724ad8cfeb16343
Author: Sujen Shah 
Date:   2015-10-19T08:44:29Z

Added properties to make publisher optional in nutch-default.xml

commit ad88c94fc274576aacaa2c17b1f55a087f7a04f9
Author: Sujen Shah 
Date:   2015-10-27T02:40:45Z

Added routingkey support

commit e380de803c8c129f0dfb7d8c31a8596b4ceae8bf
Author: Sujen Shah 
Date:   2015-10-29T20:12:55Z

Better exception handling when RMQ server is down

commit e4f5e13cc5675f9e7d37ebda39bf230c08baf4b8
Author: Sujen Shah 
Date:   2016-08-02T15:01:48Z

Created plugin system for pub/sub implementation in Nutch

commit 2c484ec4789c84f7bf9e592e15c96cf788ef5967
Author: Sujen Shah 
Date:   2016-08-02T15:12:28Z

Removed Rabbitmq dependency from ivy.xml and remove author tags




> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.13
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request #138: Fix for NUTCH-2132: Publisher/Subscriber model for ...

2016-08-02 Thread sujen1412
GitHub user sujen1412 opened a pull request:

https://github.com/apache/nutch/pull/138

Fix for NUTCH-2132: Publisher/Subscriber model for Nutch to emit events

This PR is still in progress and needs a review to get the plugin system 
working. It is not ready to commit as of yet.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sujen1412/nutch NUTCH-2132

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/138.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #138


commit 5a13301a7808d852b88d8724c3c3b7783fa9d2be
Author: Sujen Shah 
Date:   2015-10-02T08:03:35Z

Added dependency for RabbitMQ

commit 3029ef4055834b72d63e7fc516eca448c2efe32a
Author: Sujen Shah 
Date:   2015-10-02T08:04:39Z

Code for FetcherThreadPublisher

commit ebfd7728e650cc7648a3939d3985826eefde93f3
Author: Sujen Shah 
Date:   2015-10-02T08:05:11Z

Added property descriptions in nutch-default.xml

commit 445fcc2d766ddef7cc36783ebcaecb552e4f2819
Author: Sujen Shah 
Date:   2015-10-02T08:28:52Z

Added support for queue routing key

commit 44498308634dda99f543d5e18724ad8cfeb16343
Author: Sujen Shah 
Date:   2015-10-19T08:44:29Z

Added properties to make publisher optional in nutch-default.xml

commit ad88c94fc274576aacaa2c17b1f55a087f7a04f9
Author: Sujen Shah 
Date:   2015-10-27T02:40:45Z

Added routingkey support

commit e380de803c8c129f0dfb7d8c31a8596b4ceae8bf
Author: Sujen Shah 
Date:   2015-10-29T20:12:55Z

Better exception handling when RMQ server is down

commit e4f5e13cc5675f9e7d37ebda39bf230c08baf4b8
Author: Sujen Shah 
Date:   2016-08-02T15:01:48Z

Created plugin system for pub/sub implementation in Nutch

commit 2c484ec4789c84f7bf9e592e15c96cf788ef5967
Author: Sujen Shah 
Date:   2016-08-02T15:12:28Z

Removed Rabbitmq dependency from ivy.xml and remove author tags




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---