Re: [DISCUSS] Release Trunk

2014-02-13 Thread Lewis John Mcgibbney
Hi Folks,
@Tejasp

On Thu, Feb 13, 2014 at 6:30 AM, dev-digest-h...@nutch.apache.org wrote:

 Just saw the commits since 1.7 release. Apart from trivial bug fixes, we
 have some significant patches since 1.7.
 +1 for new release. I would be happy to volunteer / help.



If you're game for learning the release manager role then I'm +1 to support
you in that. We can do G+ hangout whilst you do it so that it all goes
smoothly.
If you change your mind just let me know and I'll push an RC today.
Great work on trunk folks... lots of fixes ;)
Lewis


[jira] [Commented] (NUTCH-1727) Length of the Tlds

2014-02-13 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900124#comment-13900124
 ] 

Lewis John McGibbney commented on NUTCH-1727:
-

Issue looks fine to me however some trivial unit tests would be nice.
This issue could also be applied to trunk. Any comments? 

 Length of the Tlds
 --

 Key: NUTCH-1727
 URL: https://issues.apache.org/jira/browse/NUTCH-1727
 Project: Nutch
  Issue Type: Bug
Reporter: Sertac TURKEL
Priority: Minor
 Fix For: 2.1

 Attachments: NUTCH-1727.patch


 Length of the tld  should be selectable, there is some available tld's like 
 .travel and url-validator plugin filters this type of urls.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: [DISCUSS] Release Trunk

2014-02-13 Thread Tejas Patil
Thanks Lewis. G+ hangout sounds cool. Is this wiki page complete and
updated to start off ?
http://wiki.apache.org/nutch/Release_HOWTO

Thanks,
Tejas


On Thu, Feb 13, 2014 at 12:23 AM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Folks,
 @Tejasp

 On Thu, Feb 13, 2014 at 6:30 AM, dev-digest-h...@nutch.apache.org wrote:

 Just saw the commits since 1.7 release. Apart from trivial bug fixes, we
 have some significant patches since 1.7.
 +1 for new release. I would be happy to volunteer / help.



 If you're game for learning the release manager role then I'm +1 to
 support you in that. We can do G+ hangout whilst you do it so that it all
 goes smoothly.
 If you change your mind just let me know and I'll push an RC today.
 Great work on trunk folks... lots of fixes ;)
 Lewis



RE: [DISCUSS] Release Trunk

2014-02-13 Thread Markus Jelsma
Seems some of my mails to the list are not coming through. I am -1 on release 
from trunk as is. The segment merger is still broken and in my opinion we 
cannot push yet another release with a broken segment merger.

Markus

-Original message-
From: Tejas Patiltejas.patil...@gmail.com
Sent: Thursday 13th February 2014 1:33
To: dev@nutch.apache.org
Subject: Re: [DISCUSS] Release Trunk

Just saw the commits since 1.7 release. Apart from trivial bug fixes, we have 
some significant patches since 1.7.

+1 for new release. I would be happy to volunteer / help.

Thanks,

Tejas

On Wed, Feb 12, 2014 at 7:33 AM, Julien Nioche lists.digitalpeb...@gmail.com 
mailto:lists.digitalpeb...@gmail.com wrote:

Hi guys,

At least 2 of the issues that Seb and I had mentioned have now been committed. 
What about releasing 1.8 from trunk? If so, any volunteers?

Julien

On 2 December 2013 21:02, Sebastian Nagel wastl.na...@googlemail.com 
mailto:wastl.na...@googlemail.com wrote:

Hi,

+1 to release soon (this year, or early next year)

 and probably a few others but they could also be done later.

At least, these should be done before releasing:

NUTCH-1646 IndexerMapReduce to consider DB status

NUTCH-1413 Record response time

Sebastian

On 11/28/2013 05:49 PM, Julien Nioche wrote:

 Hi Lewis



 Weve done quite a few things in 1.x since the previous release (e.g. generic 
 deduplication,

 removing indexer.solr package, etc...)  and the next 2.x release will be 
 after the changes to GORA

 have been made, tested and used on the Nutch side so that could be quite a 
 while.



 I am neutral as to whether we should do a 1.x release now. There are some 
 minor issues that we could

 do in 1.x before the next release like :

 * https://issues.apache.org/jira/browse/NUTCH-1360 
 https://issues.apache.org/jira/browse/NUTCH-1360

 * https://issues.apache.org/jira/browse/NUTCH-1676 
 https://issues.apache.org/jira/browse/NUTCH-1676

 and probably a few others but they could also be done later.



 Lets hear what others think.



 Thanks



 Julien









 On 28 November 2013 16:34, Lewis John Mcgibbney lewis.mcgibb...@gmail.com 
 mailto:lewis.mcgibb...@gmail.com

 mailto:lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com wrote:



     Hi Folks,

     Thread says it all.

     There are some hot tickets over in Gora right now so I think holding off 
 the next while for a

     2.x release would be wise.

     I can spin the RC for trunk tonight/tomorrow/weekend if we get the thumbs 
 up.

     Ta

     Lewis



     --

     /Lewis/









 --

 *

 *Open Source Solutions for Text Engineering



 http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/

 http://www.digitalpebble.com http://www.digitalpebble.com

 http://twitter.com/digitalpebble http://twitter.com/digitalpebble

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/

http://www.digitalpebble.com http://www.digitalpebble.com
http://twitter.com/digitalpebble http://twitter.com/digitalpebble




[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-13 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1726:
-

Attachment: NUTCH-1726-trunk.patch

Thanks Lufeng! Here's another patch with additional options. Because it found 
nested nodes it suddenly returned a lot of headings for a given URL.

* headings.limit similar to multivalued but allows to limit the number of 
headings per element
* headings.maxlength max length of heading
* headings.truncate what to do with too long headings, truncate or skip them?
* headings.minlength obvious
* headings.ignore.hyperlinks will ignore headings inside anchors

The headings.ignore.hyperlinks does not work despite the 
nodewalker.skipChildren() call. Haven't figured this out yet.

 HeadingsFilter does not find nested nodes
 -

 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
 NUTCH-1726-trunk.patch


 Filter won't find:
 {code}
 h1spanapache nutch/span/h1
 {code}
 The getNodeValue() tries to read data from children but should traverse nodes 
 instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2014-02-13 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900180#comment-13900180
 ] 

Markus Jelsma commented on NUTCH-961:
-

I am sorry, i did not mean to speak for the Nutch PMC at all; we not using BP 
means I am not using BP. As i said before, i am happy to commit this issue is 
the linked issues are resolved first.

 Expose Tika's boilerpipe support
 

 Key: NUTCH-961
 URL: https://issues.apache.org/jira/browse/NUTCH-961
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.3, 1.8

 Attachments: BoilerpipeExtractorRepository.java, 
 NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, 
 NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, 
 NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, 
 NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch


 Tika 0.8 comes with the Boilerpipe content handler which can be used to 
 extract boilerplate content from HTML pages. We should see how we can expose 
 Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Reopened] (NUTCH-1725) CleaningJob's reducer does not commit deleted docs.

2014-02-13 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

İlhami KALKAN reopened NUTCH-1725:
--


 CleaningJob's reducer does not commit deleted docs. 
 

 Key: NUTCH-1725
 URL: https://issues.apache.org/jira/browse/NUTCH-1725
 Project: Nutch
  Issue Type: Bug
Reporter: İlhami KALKAN

 In cleanup(Context context) method, if condition has logical problem.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1725) CleaningJob's reducer does not commit deleted docs.

2014-02-13 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

İlhami KALKAN updated NUTCH-1725:
-

Attachment: NUTCH-1725.patch

I fix the bug

 CleaningJob's reducer does not commit deleted docs. 
 

 Key: NUTCH-1725
 URL: https://issues.apache.org/jira/browse/NUTCH-1725
 Project: Nutch
  Issue Type: Bug
Reporter: İlhami KALKAN
 Attachments: NUTCH-1725.patch


 In cleanup(Context context) method, if condition has logical problem.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1728) indexer-solr plugin is not delete docs from solr

2014-02-13 Thread JIRA
İlhami KALKAN created NUTCH-1728:


 Summary: indexer-solr plugin is not delete docs from solr
 Key: NUTCH-1728
 URL: https://issues.apache.org/jira/browse/NUTCH-1728
 Project: Nutch
  Issue Type: Bug
Reporter: İlhami KALKAN


Missing delete variable used in delete(String key) method setting.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1728) indexer-solr plugin is not delete docs from solr

2014-02-13 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

İlhami KALKAN updated NUTCH-1728:
-

Attachment: NUTCH-1728.patch

Fix bug.

 indexer-solr plugin is not delete docs from solr
 

 Key: NUTCH-1728
 URL: https://issues.apache.org/jira/browse/NUTCH-1728
 Project: Nutch
  Issue Type: Bug
Reporter: İlhami KALKAN
 Attachments: NUTCH-1728.patch


 Missing delete variable used in delete(String key) method setting.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-13 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900352#comment-13900352
 ] 

Markus Jelsma commented on NUTCH-1726:
--

lufeng, it seems one of your unit tests fails, is something wrong with the test 
or is the my fix just not correct?  :)

 HeadingsFilter does not find nested nodes
 -

 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
 NUTCH-1726-trunk.patch


 Filter won't find:
 {code}
 h1spanapache nutch/span/h1
 {code}
 The getNodeValue() tries to read data from children but should traverse nodes 
 instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-13 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900432#comment-13900432
 ] 

lufeng commented on NUTCH-1726:
---

Hi Markus. 

But I didn't find any error using your newest patch. 

{code:xml}
test:
[echo] Testing plugin: headings
[junit] Running org.apache.nutch.parse.headings.TestHeadingsParseFilter
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.142 sec

BUILD SUCCESSFUL
Total time: 3 seconds
{code}

* maybe you can truncate log headers if it's size is larger than the value of 
maxlength option. so headings.truncate option can be removed.




 HeadingsFilter does not find nested nodes
 -

 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
 NUTCH-1726-trunk.patch


 Filter won't find:
 {code}
 h1spanapache nutch/span/h1
 {code}
 The getNodeValue() tries to read data from children but should traverse nodes 
 instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)