[jira] [Comment Edited] (NUTCH-1662) Indexer Plugin for Solr Cloud
[ https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872195#comment-13872195 ] Yasin Kılınç edited comment on NUTCH-1662 at 2/12/14 9:23 AM: -- I create indexer plugin of SolrCloud. This patch can apply after https://issues.apache.org/jira/browse/NUTCH-1568. was (Author: icebergx5): I create indexer plugin of SolrCloud. This patch can apply after NUTCH-1655. Indexer Plugin for Solr Cloud - Key: NUTCH-1662 URL: https://issues.apache.org/jira/browse/NUTCH-1662 Project: Nutch Issue Type: Sub-task Components: indexer Affects Versions: 2.3 Reporter: Talat UYARER Fix For: 2.3 Attachments: NUTCH-1662.patch In main issue's patch use Solr Http connection. It doesnt support Solr Could. This plugin support Solr Cloud. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (NUTCH-1662) Indexer Plugin for Solr Cloud
[ https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872195#comment-13872195 ] Yasin Kılınç edited comment on NUTCH-1662 at 2/12/14 9:27 AM: -- I create indexer plugin of SolrCloud. was (Author: icebergx5): I create indexer plugin of SolrCloud. This patch can apply after https://issues.apache.org/jira/browse/NUTCH-1568. Indexer Plugin for Solr Cloud - Key: NUTCH-1662 URL: https://issues.apache.org/jira/browse/NUTCH-1662 Project: Nutch Issue Type: Sub-task Components: indexer Affects Versions: 2.3 Reporter: Talat UYARER Fix For: 2.3 Attachments: NUTCH-1662.patch In main issue's patch use Solr Http connection. It doesnt support Solr Could. This plugin support Solr Cloud. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1718) update description of property http.robots.agent
[ https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899029#comment-13899029 ] Sebastian Nagel commented on NUTCH-1718: Hi [~tejasp], +1 to redefine {{http.robots.agents}} as additional agent names: makes it simpler for polite users which definitely should use the same user agent name in HTTP header and robots.txt. update description of property http.robots.agent Key: NUTCH-1718 URL: https://issues.apache.org/jira/browse/NUTCH-1718 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.7, 2.2, 2.2.1 Reporter: Sebastian Nagel Priority: Trivial Fix For: 2.3, 1.8 Attachments: NUTCH-1718-trunk.v1.patch The description of property http.robots.agent in nutch-default.xml recommends to add a '*' to the list of agent names. This will cause the same problem as described in NUTCH-1715. The description should be updated. Also regarding order of precedence which is dictated since NUTCH-1031 only by ordering of user agents in robots.txt. {code:xml} property namehttp.robots.agents/name value*/value descriptionThe agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* /description /property {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1726) HeadingsFilter does not find nested nodes
Markus Jelsma created NUTCH-1726: Summary: HeadingsFilter does not find nested nodes Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1726: - Attachment: NUTCH-1726-trunk.patch Patch for trunk, fixing the problem. HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899044#comment-13899044 ] Matzz commented on NUTCH-961: - {quote}We don't use it BP anymore {quote} BP integration will be totally abandoned? Are there any plans to use other content extractor in favour of Boilerpipe? Expose Tika's boilerpipe support Key: NUTCH-961 URL: https://issues.apache.org/jira/browse/NUTCH-961 Project: Nutch Issue Type: New Feature Components: parser Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 2.3, 1.8 Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Is it possible to run Nutch 2.x with httpclient 3 and 4 simultaneously?
I'm looking into upgrading the httpclient version used by protocol-httpclient because there are some fixes in the 4.x branch that I need and I realized it will be impossible to do without braking gora 3 support of hbase 90.x because the latter still uses httpclient 3 so I was wondering how bad will it be if i'll upgrade the dependencies to httpclient 4, change protocol-httpclient to use the version 4 API without touching any gora/hbase code considering the package name has changed so the new library should not effect code that didn't import the new package and have them loaded and live side by side? From the looks of it, having two versions of the same library sounds like a bad idea but i'll be happy to hear an opinion on the subject.
Re: [DISCUSS] Release Trunk
Hi guys, At least 2 of the issues that Seb and I had mentioned have now been committed. What about releasing 1.8 from trunk? If so, any volunteers? Julien On 2 December 2013 21:02, Sebastian Nagel wastl.na...@googlemail.comwrote: Hi, +1 to release soon (this year, or early next year) and probably a few others but they could also be done later. At least, these should be done before releasing: NUTCH-1646 IndexerMapReduce to consider DB status NUTCH-1413 Record response time Sebastian On 11/28/2013 05:49 PM, Julien Nioche wrote: Hi Lewis We've done quite a few things in 1.x since the previous release (e.g. generic deduplication, removing indexer.solr package, etc...) and the next 2.x release will be after the changes to GORA have been made, tested and used on the Nutch side so that could be quite a while. I am neutral as to whether we should do a 1.x release now. There are some minor issues that we could do in 1.x before the next release like : * https://issues.apache.org/jira/browse/NUTCH-1360 * https://issues.apache.org/jira/browse/NUTCH-1676 and probably a few others but they could also be done later. Let's hear what others think. Thanks Julien On 28 November 2013 16:34, Lewis John Mcgibbney lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com wrote: Hi Folks, Thread says it all. There are some hot tickets over in Gora right now so I think holding off the next while for a 2.x release would be wise. I can spin the RC for trunk tonight/tomorrow/weekend if we get the thumbs up. Ta Lewis -- /Lewis/ -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
[jira] [Created] (NUTCH-1727) Length of the Tlds
Sertac TURKEL created NUTCH-1727: Summary: Length of the Tlds Key: NUTCH-1727 URL: https://issues.apache.org/jira/browse/NUTCH-1727 Project: Nutch Issue Type: Bug Reporter: Sertac TURKEL Priority: Minor Fix For: 2.1 Length of the tld should be selectable, there is some available tld's like .travel and url-validator plugin filters this type of urls. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1727) Length of the Tlds
[ https://issues.apache.org/jira/browse/NUTCH-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sertac TURKEL updated NUTCH-1727: - Attachment: NUTCH-1727.patch I had a look domain-suffix.xml and I saw the longest domain suffix can include 8 characters(.internal). By default value, I picked 8 for this reason and I prepared a patch. Could you review my patch? Length of the Tlds -- Key: NUTCH-1727 URL: https://issues.apache.org/jira/browse/NUTCH-1727 Project: Nutch Issue Type: Bug Reporter: Sertac TURKEL Priority: Minor Fix For: 2.1 Attachments: NUTCH-1727.patch Length of the tld should be selectable, there is some available tld's like .travel and url-validator plugin filters this type of urls. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: [DISCUSS] Release Trunk
Just saw the commits since 1.7 release. Apart from trivial bug fixes, we have some significant patches since 1.7. +1 for new release. I would be happy to volunteer / help. Thanks, Tejas On Wed, Feb 12, 2014 at 7:33 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi guys, At least 2 of the issues that Seb and I had mentioned have now been committed. What about releasing 1.8 from trunk? If so, any volunteers? Julien On 2 December 2013 21:02, Sebastian Nagel wastl.na...@googlemail.comwrote: Hi, +1 to release soon (this year, or early next year) and probably a few others but they could also be done later. At least, these should be done before releasing: NUTCH-1646 IndexerMapReduce to consider DB status NUTCH-1413 Record response time Sebastian On 11/28/2013 05:49 PM, Julien Nioche wrote: Hi Lewis We've done quite a few things in 1.x since the previous release (e.g. generic deduplication, removing indexer.solr package, etc...) and the next 2.x release will be after the changes to GORA have been made, tested and used on the Nutch side so that could be quite a while. I am neutral as to whether we should do a 1.x release now. There are some minor issues that we could do in 1.x before the next release like : * https://issues.apache.org/jira/browse/NUTCH-1360 * https://issues.apache.org/jira/browse/NUTCH-1676 and probably a few others but they could also be done later. Let's hear what others think. Thanks Julien On 28 November 2013 16:34, Lewis John Mcgibbney lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com wrote: Hi Folks, Thread says it all. There are some hot tickets over in Gora right now so I think holding off the next while for a 2.x release would be wise. I can spin the RC for trunk tonight/tomorrow/weekend if we get the thumbs up. Ta Lewis -- /Lewis/ -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1726: -- Attachment: NUTCH-1726-trunk-v2.patch add a test case to check HeadingsFilter patch. :) HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)