[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-03-08 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186086#comment-15186086
 ] 

Chris A. Mattmann commented on NUTCH-2132:
--

agreed - I will try and generalize it and then update for review.

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2005) Implement HTrace'ing in Nutch

2016-03-08 Thread Farasath Ahamed (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185580#comment-15185580
 ] 

Farasath Ahamed commented on NUTCH-2005:


Hi Lewis,

I am Farasath Ahamed, final year undergraduate of University of Moratuwa, Sri 
Lanka.  I would appreciate a few pointers to get started on this idea as a 
potential project for GSoC 2016.




> Implement HTrace'ing in Nutch
> -
>
> Key: NUTCH-2005
> URL: https://issues.apache.org/jira/browse/NUTCH-2005
> Project: Nutch
>  Issue Type: New Feature
>  Components: build
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: gsoc2016
>
> I've recently been mentoring the [Apache 
> HTrace|http://htrace.incubator.apache.org/] effort, a tracing framework for 
> use with distributed systems written in Java.
> I think that being able to have fine grained tracing available within Nutch 
> would be a large strength and other string in our bows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

2016-03-08 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2202:
---

Assignee: Lewis John McGibbney

> Integration of Anthelion (Focused Crawling Module) into Nutch
> -
>
> Key: NUTCH-2202
> URL: https://issues.apache.org/jira/browse/NUTCH-2202
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, scoring
>Reporter: Robert Meusel
>Assignee: Lewis John McGibbney
>  Labels: any23, online_learning
>
> We have recently released anthelion, which is a focused crawler plugin for 
> structured data which can be extracted with any23. 
> (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John 
> McGibbney) we think the integration of the parser (any23) and the scoring 
> function based on the online learner could be a good improvement for nutch. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

2016-03-08 Thread Robert Meusel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185025#comment-15185025
 ] 

Robert Meusel commented on NUTCH-2202:
--

Hi There,

Sorry for the delay. I did not see this. I will have a look at this next week - 
as I am currently stuck in some other deadlines. 

Cheers,
Robert

> Integration of Anthelion (Focused Crawling Module) into Nutch
> -
>
> Key: NUTCH-2202
> URL: https://issues.apache.org/jira/browse/NUTCH-2202
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, scoring
>Reporter: Robert Meusel
>  Labels: any23, online_learning
>
> We have recently released anthelion, which is a focused crawler plugin for 
> structured data which can be extracted with any23. 
> (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John 
> McGibbney) we think the integration of the parser (any23) and the scoring 
> function based on the online learner could be a good improvement for nutch. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch

2016-03-08 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2202:

Description: We have recently released anthelion, which is a focused 
crawler plugin for structured data which can be extracted with any23. 
(https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John 
McGibbney) we think the integration of the parser (any23) and the scoring 
function based on the online learner could be a good improvement for nutch.   
(was: We have recently released anthelion, which is a focused crawler plugin 
for structured data which can be extracted with any23. 
(https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John 
McGibbney) we think the integration of the parser (any23) and the scoring 
funciton based on the online learner could be a good improvement for nutch. )

> Integration of Anthelion (Focused Crawling Module) into Nutch
> -
>
> Key: NUTCH-2202
> URL: https://issues.apache.org/jira/browse/NUTCH-2202
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, scoring
>Reporter: Robert Meusel
>  Labels: any23, online_learning
>
> We have recently released anthelion, which is a focused crawler plugin for 
> structured data which can be extracted with any23. 
> (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John 
> McGibbney) we think the integration of the parser (any23) and the scoring 
> function based on the online learner could be a good improvement for nutch. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-03-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184693#comment-15184693
 ] 

Markus Jelsma commented on NUTCH-2132:
--

Hello - this is interesting indeed. I read the patch and at first sight there's 
only one issue i'd like to see to be resolved:
* FetcherThreadPublisher has an import on RabbitMQPublisher and the publisher 
selection code is static. Without modifying Nutch itself, one cannot implement 
a new publisherImpl. Shouldn't it load any FQCN that implemetns NutchPublisher?

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)