Re: [DISCUSS] Release Trunk
Hi Folks, @Tejasp On Thu, Feb 13, 2014 at 6:30 AM, dev-digest-h...@nutch.apache.org wrote: Just saw the commits since 1.7 release. Apart from trivial bug fixes, we have some significant patches since 1.7. +1 for new release. I would be happy to volunteer / help. If you're game for learning the release manager role then I'm +1 to support you in that. We can do G+ hangout whilst you do it so that it all goes smoothly. If you change your mind just let me know and I'll push an RC today. Great work on trunk folks... lots of fixes ;) Lewis
[jira] [Commented] (NUTCH-1727) Length of the Tlds
[ https://issues.apache.org/jira/browse/NUTCH-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900124#comment-13900124 ] Lewis John McGibbney commented on NUTCH-1727: - Issue looks fine to me however some trivial unit tests would be nice. This issue could also be applied to trunk. Any comments? Length of the Tlds -- Key: NUTCH-1727 URL: https://issues.apache.org/jira/browse/NUTCH-1727 Project: Nutch Issue Type: Bug Reporter: Sertac TURKEL Priority: Minor Fix For: 2.1 Attachments: NUTCH-1727.patch Length of the tld should be selectable, there is some available tld's like .travel and url-validator plugin filters this type of urls. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: [DISCUSS] Release Trunk
Thanks Lewis. G+ hangout sounds cool. Is this wiki page complete and updated to start off ? http://wiki.apache.org/nutch/Release_HOWTO Thanks, Tejas On Thu, Feb 13, 2014 at 12:23 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Folks, @Tejasp On Thu, Feb 13, 2014 at 6:30 AM, dev-digest-h...@nutch.apache.org wrote: Just saw the commits since 1.7 release. Apart from trivial bug fixes, we have some significant patches since 1.7. +1 for new release. I would be happy to volunteer / help. If you're game for learning the release manager role then I'm +1 to support you in that. We can do G+ hangout whilst you do it so that it all goes smoothly. If you change your mind just let me know and I'll push an RC today. Great work on trunk folks... lots of fixes ;) Lewis
RE: [DISCUSS] Release Trunk
Seems some of my mails to the list are not coming through. I am -1 on release from trunk as is. The segment merger is still broken and in my opinion we cannot push yet another release with a broken segment merger. Markus -Original message- From: Tejas Patiltejas.patil...@gmail.com Sent: Thursday 13th February 2014 1:33 To: dev@nutch.apache.org Subject: Re: [DISCUSS] Release Trunk Just saw the commits since 1.7 release. Apart from trivial bug fixes, we have some significant patches since 1.7. +1 for new release. I would be happy to volunteer / help. Thanks, Tejas On Wed, Feb 12, 2014 at 7:33 AM, Julien Nioche lists.digitalpeb...@gmail.com mailto:lists.digitalpeb...@gmail.com wrote: Hi guys, At least 2 of the issues that Seb and I had mentioned have now been committed. What about releasing 1.8 from trunk? If so, any volunteers? Julien On 2 December 2013 21:02, Sebastian Nagel wastl.na...@googlemail.com mailto:wastl.na...@googlemail.com wrote: Hi, +1 to release soon (this year, or early next year) and probably a few others but they could also be done later. At least, these should be done before releasing: NUTCH-1646 IndexerMapReduce to consider DB status NUTCH-1413 Record response time Sebastian On 11/28/2013 05:49 PM, Julien Nioche wrote: Hi Lewis Weve done quite a few things in 1.x since the previous release (e.g. generic deduplication, removing indexer.solr package, etc...) and the next 2.x release will be after the changes to GORA have been made, tested and used on the Nutch side so that could be quite a while. I am neutral as to whether we should do a 1.x release now. There are some minor issues that we could do in 1.x before the next release like : * https://issues.apache.org/jira/browse/NUTCH-1360 https://issues.apache.org/jira/browse/NUTCH-1360 * https://issues.apache.org/jira/browse/NUTCH-1676 https://issues.apache.org/jira/browse/NUTCH-1676 and probably a few others but they could also be done later. Lets hear what others think. Thanks Julien On 28 November 2013 16:34, Lewis John Mcgibbney lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com wrote: Hi Folks, Thread says it all. There are some hot tickets over in Gora right now so I think holding off the next while for a 2.x release would be wise. I can spin the RC for trunk tonight/tomorrow/weekend if we get the thumbs up. Ta Lewis -- /Lewis/ -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://www.digitalpebble.com http://twitter.com/digitalpebble http://twitter.com/digitalpebble -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://www.digitalpebble.com http://twitter.com/digitalpebble http://twitter.com/digitalpebble
[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1726: - Attachment: NUTCH-1726-trunk.patch Thanks Lufeng! Here's another patch with additional options. Because it found nested nodes it suddenly returned a lot of headings for a given URL. * headings.limit similar to multivalued but allows to limit the number of headings per element * headings.maxlength max length of heading * headings.truncate what to do with too long headings, truncate or skip them? * headings.minlength obvious * headings.ignore.hyperlinks will ignore headings inside anchors The headings.ignore.hyperlinks does not work despite the nodewalker.skipChildren() call. Haven't figured this out yet. HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900180#comment-13900180 ] Markus Jelsma commented on NUTCH-961: - I am sorry, i did not mean to speak for the Nutch PMC at all; we not using BP means I am not using BP. As i said before, i am happy to commit this issue is the linked issues are resolved first. Expose Tika's boilerpipe support Key: NUTCH-961 URL: https://issues.apache.org/jira/browse/NUTCH-961 Project: Nutch Issue Type: New Feature Components: parser Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 2.3, 1.8 Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Reopened] (NUTCH-1725) CleaningJob's reducer does not commit deleted docs.
[ https://issues.apache.org/jira/browse/NUTCH-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] İlhami KALKAN reopened NUTCH-1725: -- CleaningJob's reducer does not commit deleted docs. Key: NUTCH-1725 URL: https://issues.apache.org/jira/browse/NUTCH-1725 Project: Nutch Issue Type: Bug Reporter: İlhami KALKAN In cleanup(Context context) method, if condition has logical problem. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1725) CleaningJob's reducer does not commit deleted docs.
[ https://issues.apache.org/jira/browse/NUTCH-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] İlhami KALKAN updated NUTCH-1725: - Attachment: NUTCH-1725.patch I fix the bug CleaningJob's reducer does not commit deleted docs. Key: NUTCH-1725 URL: https://issues.apache.org/jira/browse/NUTCH-1725 Project: Nutch Issue Type: Bug Reporter: İlhami KALKAN Attachments: NUTCH-1725.patch In cleanup(Context context) method, if condition has logical problem. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1728) indexer-solr plugin is not delete docs from solr
İlhami KALKAN created NUTCH-1728: Summary: indexer-solr plugin is not delete docs from solr Key: NUTCH-1728 URL: https://issues.apache.org/jira/browse/NUTCH-1728 Project: Nutch Issue Type: Bug Reporter: İlhami KALKAN Missing delete variable used in delete(String key) method setting. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1728) indexer-solr plugin is not delete docs from solr
[ https://issues.apache.org/jira/browse/NUTCH-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] İlhami KALKAN updated NUTCH-1728: - Attachment: NUTCH-1728.patch Fix bug. indexer-solr plugin is not delete docs from solr Key: NUTCH-1728 URL: https://issues.apache.org/jira/browse/NUTCH-1728 Project: Nutch Issue Type: Bug Reporter: İlhami KALKAN Attachments: NUTCH-1728.patch Missing delete variable used in delete(String key) method setting. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900352#comment-13900352 ] Markus Jelsma commented on NUTCH-1726: -- lufeng, it seems one of your unit tests fails, is something wrong with the test or is the my fix just not correct? :) HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900432#comment-13900432 ] lufeng commented on NUTCH-1726: --- Hi Markus. But I didn't find any error using your newest patch. {code:xml} test: [echo] Testing plugin: headings [junit] Running org.apache.nutch.parse.headings.TestHeadingsParseFilter [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.142 sec BUILD SUCCESSFUL Total time: 3 seconds {code} * maybe you can truncate log headers if it's size is larger than the value of maxlength option. so headings.truncate option can be removed. HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)