[jira] [Comment Edited] (NUTCH-1662) Indexer Plugin for Solr Cloud

2014-02-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872195#comment-13872195
 ] 

Yasin Kılınç edited comment on NUTCH-1662 at 2/12/14 9:23 AM:
--

I create indexer plugin of SolrCloud. This patch can apply after 
https://issues.apache.org/jira/browse/NUTCH-1568.


was (Author: icebergx5):
I create indexer plugin of SolrCloud. This patch can apply after NUTCH-1655.

 Indexer Plugin for Solr Cloud
 -

 Key: NUTCH-1662
 URL: https://issues.apache.org/jira/browse/NUTCH-1662
 Project: Nutch
  Issue Type: Sub-task
  Components: indexer
Affects Versions: 2.3
Reporter: Talat UYARER
 Fix For: 2.3

 Attachments: NUTCH-1662.patch


 In main issue's patch use Solr Http connection. It doesnt support Solr Could. 
 This plugin support Solr Cloud. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (NUTCH-1662) Indexer Plugin for Solr Cloud

2014-02-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872195#comment-13872195
 ] 

Yasin Kılınç edited comment on NUTCH-1662 at 2/12/14 9:27 AM:
--

I create indexer plugin of SolrCloud.


was (Author: icebergx5):
I create indexer plugin of SolrCloud. This patch can apply after 
https://issues.apache.org/jira/browse/NUTCH-1568.

 Indexer Plugin for Solr Cloud
 -

 Key: NUTCH-1662
 URL: https://issues.apache.org/jira/browse/NUTCH-1662
 Project: Nutch
  Issue Type: Sub-task
  Components: indexer
Affects Versions: 2.3
Reporter: Talat UYARER
 Fix For: 2.3

 Attachments: NUTCH-1662.patch


 In main issue's patch use Solr Http connection. It doesnt support Solr Could. 
 This plugin support Solr Cloud. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1718) update description of property http.robots.agent

2014-02-12 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899029#comment-13899029
 ] 

Sebastian Nagel commented on NUTCH-1718:


Hi [~tejasp], +1 to redefine {{http.robots.agents}} as additional agent 
names: makes it simpler for polite users which definitely should use the same 
user agent name in HTTP header and robots.txt.

 update description of property http.robots.agent
 

 Key: NUTCH-1718
 URL: https://issues.apache.org/jira/browse/NUTCH-1718
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.7, 2.2, 2.2.1
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1718-trunk.v1.patch


 The description of property http.robots.agent in nutch-default.xml recommends 
 to add a '*' to the list of agent names. This will cause the same problem as 
 described in NUTCH-1715. The description should be updated. Also regarding 
 order of precedence which is dictated since NUTCH-1031 only by ordering of 
 user agents in robots.txt.
 {code:xml}
 property
   namehttp.robots.agents/name
   value*/value
   descriptionThe agent strings we'll look for in robots.txt files,
   comma-separated, in decreasing order of precedence. You should
   put the value of http.agent.name as the first agent name, and keep the
   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
   /description
 /property
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-12 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-1726:


 Summary: HeadingsFilter does not find nested nodes
 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8


Filter won't find:
{code}
h1spanapache nutch/span/h1
{code}

The getNodeValue() tries to read data from children but should traverse nodes 
instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-12 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1726:
-

Attachment: NUTCH-1726-trunk.patch

Patch for trunk, fixing the problem.

 HeadingsFilter does not find nested nodes
 -

 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1726-trunk.patch


 Filter won't find:
 {code}
 h1spanapache nutch/span/h1
 {code}
 The getNodeValue() tries to read data from children but should traverse nodes 
 instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2014-02-12 Thread Matzz (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899044#comment-13899044
 ] 

Matzz commented on NUTCH-961:
-

{quote}We don't use it BP anymore {quote}

BP integration will be totally abandoned? Are there any plans to use other 
content extractor in favour of Boilerpipe?

 Expose Tika's boilerpipe support
 

 Key: NUTCH-961
 URL: https://issues.apache.org/jira/browse/NUTCH-961
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.3, 1.8

 Attachments: BoilerpipeExtractorRepository.java, 
 NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, 
 NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, 
 NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, 
 NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch


 Tika 0.8 comes with the Boilerpipe content handler which can be used to 
 extract boilerplate content from HTML pages. We should see how we can expose 
 Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Is it possible to run Nutch 2.x with httpclient 3 and 4 simultaneously?

2014-02-12 Thread d_k
I'm looking into upgrading the httpclient version used by
protocol-httpclient because there are some fixes in the 4.x branch that I
need and I realized it will be impossible to do without braking gora 3
support of hbase 90.x because the latter still uses httpclient 3 so I was
wondering how bad will it be if i'll upgrade the dependencies to httpclient
4, change protocol-httpclient to use the version 4 API without touching any
gora/hbase code considering the package name has changed so the new library
should not effect code that didn't import the new package and have them
loaded and live side by side?

From the looks of it, having two versions of the same library sounds like a
bad idea but i'll be happy to hear an opinion on the subject.


Re: [DISCUSS] Release Trunk

2014-02-12 Thread Julien Nioche
Hi guys,

At least 2 of the issues that Seb and I had mentioned have now been
committed. What about releasing 1.8 from trunk? If so, any volunteers?

Julien


On 2 December 2013 21:02, Sebastian Nagel wastl.na...@googlemail.comwrote:

 Hi,

 +1 to release soon (this year, or early next year)

  and probably a few others but they could also be done later.
 At least, these should be done before releasing:
 NUTCH-1646 IndexerMapReduce to consider DB status
 NUTCH-1413 Record response time

 Sebastian

 On 11/28/2013 05:49 PM, Julien Nioche wrote:
  Hi Lewis
 
  We've done quite a few things in 1.x since the previous release (e.g.
 generic deduplication,
  removing indexer.solr package, etc...)  and the next 2.x release will be
 after the changes to GORA
  have been made, tested and used on the Nutch side so that could be quite
 a while.
 
  I am neutral as to whether we should do a 1.x release now. There are
 some minor issues that we could
  do in 1.x before the next release like :
  * https://issues.apache.org/jira/browse/NUTCH-1360
  * https://issues.apache.org/jira/browse/NUTCH-1676
  and probably a few others but they could also be done later.
 
  Let's hear what others think.
 
  Thanks
 
  Julien
 
 
 
 
  On 28 November 2013 16:34, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com
  mailto:lewis.mcgibb...@gmail.com wrote:
 
  Hi Folks,
  Thread says it all.
  There are some hot tickets over in Gora right now so I think holding
 off the next while for a
  2.x release would be wise.
  I can spin the RC for trunk tonight/tomorrow/weekend if we get the
 thumbs up.
  Ta
  Lewis
 
  --
  /Lewis/
 
 
 
 
  --
  *
  *Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Created] (NUTCH-1727) Length of the Tlds

2014-02-12 Thread Sertac TURKEL (JIRA)
Sertac TURKEL created NUTCH-1727:


 Summary: Length of the Tlds
 Key: NUTCH-1727
 URL: https://issues.apache.org/jira/browse/NUTCH-1727
 Project: Nutch
  Issue Type: Bug
Reporter: Sertac TURKEL
Priority: Minor
 Fix For: 2.1


Length of the tld  should be selectable, there is some available tld's like 
.travel and url-validator plugin filters this type of urls.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1727) Length of the Tlds

2014-02-12 Thread Sertac TURKEL (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sertac TURKEL updated NUTCH-1727:
-

Attachment: NUTCH-1727.patch

I had a look domain-suffix.xml  and I saw the longest domain suffix can include 
8 characters(.internal). By default value, I picked 8 for this reason and I 
prepared a patch.  Could you review my patch?

 Length of the Tlds
 --

 Key: NUTCH-1727
 URL: https://issues.apache.org/jira/browse/NUTCH-1727
 Project: Nutch
  Issue Type: Bug
Reporter: Sertac TURKEL
Priority: Minor
 Fix For: 2.1

 Attachments: NUTCH-1727.patch


 Length of the tld  should be selectable, there is some available tld's like 
 .travel and url-validator plugin filters this type of urls.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: [DISCUSS] Release Trunk

2014-02-12 Thread Tejas Patil
Just saw the commits since 1.7 release. Apart from trivial bug fixes, we
have some significant patches since 1.7.
+1 for new release. I would be happy to volunteer / help.

Thanks,
Tejas



On Wed, Feb 12, 2014 at 7:33 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi guys,

 At least 2 of the issues that Seb and I had mentioned have now been
 committed. What about releasing 1.8 from trunk? If so, any volunteers?

 Julien


 On 2 December 2013 21:02, Sebastian Nagel wastl.na...@googlemail.comwrote:

 Hi,

 +1 to release soon (this year, or early next year)

  and probably a few others but they could also be done later.
 At least, these should be done before releasing:
 NUTCH-1646 IndexerMapReduce to consider DB status
 NUTCH-1413 Record response time

 Sebastian

 On 11/28/2013 05:49 PM, Julien Nioche wrote:
  Hi Lewis
 
  We've done quite a few things in 1.x since the previous release (e.g.
 generic deduplication,
  removing indexer.solr package, etc...)  and the next 2.x release will
 be after the changes to GORA
  have been made, tested and used on the Nutch side so that could be
 quite a while.
 
  I am neutral as to whether we should do a 1.x release now. There are
 some minor issues that we could
  do in 1.x before the next release like :
  * https://issues.apache.org/jira/browse/NUTCH-1360
  * https://issues.apache.org/jira/browse/NUTCH-1676
  and probably a few others but they could also be done later.
 
  Let's hear what others think.
 
  Thanks
 
  Julien
 
 
 
 
  On 28 November 2013 16:34, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com
  mailto:lewis.mcgibb...@gmail.com wrote:
 
  Hi Folks,
  Thread says it all.
  There are some hot tickets over in Gora right now so I think
 holding off the next while for a
  2.x release would be wise.
  I can spin the RC for trunk tonight/tomorrow/weekend if we get the
 thumbs up.
  Ta
  Lewis
 
  --
  /Lewis/
 
 
 
 
  --
  *
  *Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble




 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble



[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-12 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1726:
--

Attachment: NUTCH-1726-trunk-v2.patch

add a test case to check HeadingsFilter patch. :)

 HeadingsFilter does not find nested nodes
 -

 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch


 Filter won't find:
 {code}
 h1spanapache nutch/span/h1
 {code}
 The getNodeValue() tries to read data from children but should traverse nodes 
 instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)