[jira] [Updated] (NUTCH-1525) Generator to record external links even when db.ignore.external.links set to true
[ https://issues.apache.org/jira/browse/NUTCH-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitry Cherniachenko updated NUTCH-1525: Attachment: nutch-logExternal.patch Attached the patch for Nutch 1.7 With it applied you can add the following to log4j.properties {code} log4j.logger.org.apache.nutch.parse.ParseOutputFormat.externalLinks=INFO,extlinks log4j.appender.extlinks=org.apache.log4j.DailyRollingFileAppender log4j.appender.extlinks.File=${hadoop.log.dir}/external-links.log log4j.appender.extlinks.DatePattern=.-MM-dd log4j.appender.extlinks.layout=org.apache.log4j.PatternLayout log4j.appender.extlinks.layout.ConversionPattern=%m%n {code} And then all the ignored external links will be logged cleanly to external-links.log Generator to record external links even when db.ignore.external.links set to true -- Key: NUTCH-1525 URL: https://issues.apache.org/jira/browse/NUTCH-1525 Project: Nutch Issue Type: Improvement Components: generator Reporter: Lewis John McGibbney Priority: Minor Fix For: 2.4 Attachments: nutch-logExternal.patch When fetching pages from specific domains we have various options e.g. use urlfilters, set the above property to true before injecting urls into the webdb etc. However with the former, it is recognised that complex regex can slow down processing and with the latter it means we disregard a number of urls which could potentially become useful in the future. Unfortunately there is no way to record external links encountered for future processing, although the wiki suggests that a very small patch to the generator code can allow you to log these links to hadoop.log. although this is better, a more robusts storage mechanism would be preferred. This may tie in with custom counters we've already specified or may require new counters to be implemented. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1525) Generator to record external links even when db.ignore.external.links set to true
[ https://issues.apache.org/jira/browse/NUTCH-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901313#comment-13901313 ] Lewis John McGibbney commented on NUTCH-1525: - [~sabio], thank you for the patch. I totally forgot about this issue. Can we verify if we are able to derive Hadoop counters as well as/instead of simple logging? If we can obtain counters then it is much easier to analyze the number of external links we filter. Generator to record external links even when db.ignore.external.links set to true -- Key: NUTCH-1525 URL: https://issues.apache.org/jira/browse/NUTCH-1525 Project: Nutch Issue Type: Improvement Components: generator Reporter: Lewis John McGibbney Priority: Minor Fix For: 2.4 Attachments: nutch-logExternal.patch When fetching pages from specific domains we have various options e.g. use urlfilters, set the above property to true before injecting urls into the webdb etc. However with the former, it is recognised that complex regex can slow down processing and with the latter it means we disregard a number of urls which could potentially become useful in the future. Unfortunately there is no way to record external links encountered for future processing, although the wiki suggests that a very small patch to the generator code can allow you to log these links to hadoop.log. although this is better, a more robusts storage mechanism would be preferred. This may tie in with custom counters we've already specified or may require new counters to be implemented. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901363#comment-13901363 ] Markus Jelsma commented on NUTCH-1726: -- Hi lufeng! I don't understand, i have a clean Apache Nutch headings plugin, the same test fails for my patch and your patch. {code} Testcase: testIt took 1.489 sec Testcase: testMultiValueMetatags took 0.185 sec FAILED One value of metatag with multiple values is missing: Test header h2 with span junit.framework.AssertionFailedError: One value of metatag with multiple values is missing: Test header h2 with span at org.apache.nutch.parse.headings.TestHeadingsParseFilter.testMultiValueMetatags(TestHeadingsParseFilter.java:97) {code} I added truncate because perhaps some users may want to ignore long headers instead of truncating them. If i get a header containing 2kb of text, i think i would like to skip it, not truncate. Markus HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: [DISCUSS] Release Trunk
Hi, -1 also from me for now. Beside SegmentMerger (NUTCH-1113) there is a problem in indexer (NUTCH-1706/NUTCH-1646) which should be fixed. I hope to tackle both issues soon. Sebastian On 02/13/2014 10:19 AM, Markus Jelsma wrote: Seems some of my mails to the list are not coming through. I am -1 on release from trunk as is. The segment merger is still broken and in my opinion we cannot push yet another release with a broken segment merger. Markus -Original message- From: Tejas Patiltejas.patil...@gmail.com Sent: Thursday 13th February 2014 1:33 To: dev@nutch.apache.org Subject: Re: [DISCUSS] Release Trunk Just saw the commits since 1.7 release. Apart from trivial bug fixes, we have some significant patches since 1.7. +1 for new release. I would be happy to volunteer / help. Thanks, Tejas On Wed, Feb 12, 2014 at 7:33 AM, Julien Nioche lists.digitalpeb...@gmail.com mailto:lists.digitalpeb...@gmail.com wrote: Hi guys, At least 2 of the issues that Seb and I had mentioned have now been committed. What about releasing 1.8 from trunk? If so, any volunteers? Julien On 2 December 2013 21:02, Sebastian Nagel wastl.na...@googlemail.com mailto:wastl.na...@googlemail.com wrote: Hi, +1 to release soon (this year, or early next year) and probably a few others but they could also be done later. At least, these should be done before releasing: NUTCH-1646 IndexerMapReduce to consider DB status NUTCH-1413 Record response time Sebastian On 11/28/2013 05:49 PM, Julien Nioche wrote: Hi Lewis Weve done quite a few things in 1.x since the previous release (e.g. generic deduplication, removing indexer.solr package, etc...) and the next 2.x release will be after the changes to GORA have been made, tested and used on the Nutch side so that could be quite a while. I am neutral as to whether we should do a 1.x release now. There are some minor issues that we could do in 1.x before the next release like : * https://issues.apache.org/jira/browse/NUTCH-1360 https://issues.apache.org/jira/browse/NUTCH-1360 * https://issues.apache.org/jira/browse/NUTCH-1676 https://issues.apache.org/jira/browse/NUTCH-1676 and probably a few others but they could also be done later. Lets hear what others think. Thanks Julien On 28 November 2013 16:34, Lewis John Mcgibbney lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com wrote: Hi Folks, Thread says it all. There are some hot tickets over in Gora right now so I think holding off the next while for a 2.x release would be wise. I can spin the RC for trunk tonight/tomorrow/weekend if we get the thumbs up. Ta Lewis -- /Lewis/ -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://www.digitalpebble.com http://twitter.com/digitalpebble http://twitter.com/digitalpebble -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://www.digitalpebble.com http://twitter.com/digitalpebble http://twitter.com/digitalpebble