[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186086#comment-15186086 ] Chris A. Mattmann commented on NUTCH-2132: -- agreed - I will try and generalize it and then update for review. > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, > PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2005) Implement HTrace'ing in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185580#comment-15185580 ] Farasath Ahamed commented on NUTCH-2005: Hi Lewis, I am Farasath Ahamed, final year undergraduate of University of Moratuwa, Sri Lanka. I would appreciate a few pointers to get started on this idea as a potential project for GSoC 2016. > Implement HTrace'ing in Nutch > - > > Key: NUTCH-2005 > URL: https://issues.apache.org/jira/browse/NUTCH-2005 > Project: Nutch > Issue Type: New Feature > Components: build >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: gsoc2016 > > I've recently been mentoring the [Apache > HTrace|http://htrace.incubator.apache.org/] effort, a tracing framework for > use with distributed systems written in Java. > I think that being able to have fine grained tracing available within Nutch > would be a large strength and other string in our bows. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch
[ https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2202: --- Assignee: Lewis John McGibbney > Integration of Anthelion (Focused Crawling Module) into Nutch > - > > Key: NUTCH-2202 > URL: https://issues.apache.org/jira/browse/NUTCH-2202 > Project: Nutch > Issue Type: Improvement > Components: parser, scoring >Reporter: Robert Meusel >Assignee: Lewis John McGibbney > Labels: any23, online_learning > > We have recently released anthelion, which is a focused crawler plugin for > structured data which can be extracted with any23. > (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John > McGibbney) we think the integration of the parser (any23) and the scoring > function based on the online learner could be a good improvement for nutch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch
[ https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185025#comment-15185025 ] Robert Meusel commented on NUTCH-2202: -- Hi There, Sorry for the delay. I did not see this. I will have a look at this next week - as I am currently stuck in some other deadlines. Cheers, Robert > Integration of Anthelion (Focused Crawling Module) into Nutch > - > > Key: NUTCH-2202 > URL: https://issues.apache.org/jira/browse/NUTCH-2202 > Project: Nutch > Issue Type: Improvement > Components: parser, scoring >Reporter: Robert Meusel > Labels: any23, online_learning > > We have recently released anthelion, which is a focused crawler plugin for > structured data which can be extracted with any23. > (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John > McGibbney) we think the integration of the parser (any23) and the scoring > function based on the online learner could be a good improvement for nutch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2202) Integration of Anthelion (Focused Crawling Module) into Nutch
[ https://issues.apache.org/jira/browse/NUTCH-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2202: Description: We have recently released anthelion, which is a focused crawler plugin for structured data which can be extracted with any23. (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John McGibbney) we think the integration of the parser (any23) and the scoring function based on the online learner could be a good improvement for nutch. (was: We have recently released anthelion, which is a focused crawler plugin for structured data which can be extracted with any23. (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John McGibbney) we think the integration of the parser (any23) and the scoring funciton based on the online learner could be a good improvement for nutch. ) > Integration of Anthelion (Focused Crawling Module) into Nutch > - > > Key: NUTCH-2202 > URL: https://issues.apache.org/jira/browse/NUTCH-2202 > Project: Nutch > Issue Type: Improvement > Components: parser, scoring >Reporter: Robert Meusel > Labels: any23, online_learning > > We have recently released anthelion, which is a focused crawler plugin for > structured data which can be extracted with any23. > (https://github.com/yahoo/anthelion) As proposed by Lewis (Lewis John > McGibbney) we think the integration of the parser (any23) and the scoring > function based on the online learner could be a good improvement for nutch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184693#comment-15184693 ] Markus Jelsma commented on NUTCH-2132: -- Hello - this is interesting indeed. I read the patch and at first sight there's only one issue i'd like to see to be resolved: * FetcherThreadPublisher has an import on RabbitMQPublisher and the publisher selection code is static. Without modifying Nutch itself, one cannot implement a new publisherImpl. Shouldn't it load any FQCN that implemetns NutchPublisher? > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, > PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)