[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-13 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900432#comment-13900432
 ] 

lufeng commented on NUTCH-1726:
---

Hi Markus. 

But I didn't find any error using your newest patch. 

{code:xml}
test:
[echo] Testing plugin: headings
[junit] Running org.apache.nutch.parse.headings.TestHeadingsParseFilter
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.142 sec

BUILD SUCCESSFUL
Total time: 3 seconds
{code}

* maybe you can truncate log headers if it's size is larger than the value of 
maxlength option. so headings.truncate option can be removed.




> HeadingsFilter does not find nested nodes
> -
>
> Key: NUTCH-1726
> URL: https://issues.apache.org/jira/browse/NUTCH-1726
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
> NUTCH-1726-trunk.patch
>
>
> Filter won't find:
> {code}
> apache nutch
> {code}
> The getNodeValue() tries to read data from children but should traverse nodes 
> instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-13 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900352#comment-13900352
 ] 

Markus Jelsma commented on NUTCH-1726:
--

lufeng, it seems one of your unit tests fails, is something wrong with the test 
or is the my fix just not correct?  :)

> HeadingsFilter does not find nested nodes
> -
>
> Key: NUTCH-1726
> URL: https://issues.apache.org/jira/browse/NUTCH-1726
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
> NUTCH-1726-trunk.patch
>
>
> Filter won't find:
> {code}
> apache nutch
> {code}
> The getNodeValue() tries to read data from children but should traverse nodes 
> instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1728) indexer-solr plugin is not delete docs from solr

2014-02-13 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

İlhami KALKAN updated NUTCH-1728:
-

Attachment: NUTCH-1728.patch

Fix bug.

> indexer-solr plugin is not delete docs from solr
> 
>
> Key: NUTCH-1728
> URL: https://issues.apache.org/jira/browse/NUTCH-1728
> Project: Nutch
>  Issue Type: Bug
>Reporter: İlhami KALKAN
> Attachments: NUTCH-1728.patch
>
>
> Missing "delete" variable used in delete(String key) method setting.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1728) indexer-solr plugin is not delete docs from solr

2014-02-13 Thread JIRA
İlhami KALKAN created NUTCH-1728:


 Summary: indexer-solr plugin is not delete docs from solr
 Key: NUTCH-1728
 URL: https://issues.apache.org/jira/browse/NUTCH-1728
 Project: Nutch
  Issue Type: Bug
Reporter: İlhami KALKAN


Missing "delete" variable used in delete(String key) method setting.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1725) CleaningJob's reducer does not commit deleted docs.

2014-02-13 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

İlhami KALKAN updated NUTCH-1725:
-

Attachment: NUTCH-1725.patch

I fix the bug

> CleaningJob's reducer does not commit deleted docs. 
> 
>
> Key: NUTCH-1725
> URL: https://issues.apache.org/jira/browse/NUTCH-1725
> Project: Nutch
>  Issue Type: Bug
>Reporter: İlhami KALKAN
> Attachments: NUTCH-1725.patch
>
>
> In cleanup(Context context) method, "if condition" has logical problem.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Reopened] (NUTCH-1725) CleaningJob's reducer does not commit deleted docs.

2014-02-13 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

İlhami KALKAN reopened NUTCH-1725:
--


> CleaningJob's reducer does not commit deleted docs. 
> 
>
> Key: NUTCH-1725
> URL: https://issues.apache.org/jira/browse/NUTCH-1725
> Project: Nutch
>  Issue Type: Bug
>Reporter: İlhami KALKAN
>
> In cleanup(Context context) method, "if condition" has logical problem.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2014-02-13 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900180#comment-13900180
 ] 

Markus Jelsma commented on NUTCH-961:
-

I am sorry, i did not mean to speak for the Nutch PMC at all; we not using BP 
means I am not using BP. As i said before, i am happy to commit this issue is 
the linked issues are resolved first.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 2.3, 1.8
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, 
> NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, 
> NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, 
> NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-13 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1726:
-

Attachment: NUTCH-1726-trunk.patch

Thanks Lufeng! Here's another patch with additional options. Because it found 
nested nodes it suddenly returned a lot of headings for a given URL.

* headings.limit similar to multivalued but allows to limit the number of 
headings per element
* headings.maxlength max length of heading
* headings.truncate what to do with too long headings, truncate or skip them?
* headings.minlength obvious
* headings.ignore.hyperlinks will ignore headings inside anchors

The headings.ignore.hyperlinks does not work despite the 
nodewalker.skipChildren() call. Haven't figured this out yet.

> HeadingsFilter does not find nested nodes
> -
>
> Key: NUTCH-1726
> URL: https://issues.apache.org/jira/browse/NUTCH-1726
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
> NUTCH-1726-trunk.patch
>
>
> Filter won't find:
> {code}
> apache nutch
> {code}
> The getNodeValue() tries to read data from children but should traverse nodes 
> instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


RE: [DISCUSS] Release Trunk

2014-02-13 Thread Markus Jelsma
Seems some of my mails to the list are not coming through. I am -1 on release 
from trunk as is. The segment merger is still broken and in my opinion we 
cannot push yet another release with a broken segment merger.

Markus

-Original message-
From: Tejas Patil
Sent: Thursday 13th February 2014 1:33
To: dev@nutch.apache.org
Subject: Re: [DISCUSS] Release Trunk

Just saw the commits since 1.7 release. Apart from trivial bug fixes, we have 
some significant patches since 1.7.

+1 for new release. I would be happy to volunteer / help.

Thanks,

Tejas

On Wed, Feb 12, 2014 at 7:33 AM, Julien Nioche mailto:lists.digitalpeb...@gmail.com>> wrote:

Hi guys,

At least 2 of the issues that Seb and I had mentioned have now been committed. 
What about releasing 1.8 from trunk? If so, any volunteers?

Julien

On 2 December 2013 21:02, Sebastian Nagel mailto:wastl.na...@googlemail.com>> wrote:

Hi,

+1 to release soon (this year, or early next year)

> and probably a few others but they could also be done later.

At least, these should be done before releasing:

NUTCH-1646 IndexerMapReduce to consider DB status

NUTCH-1413 Record response time

Sebastian

On 11/28/2013 05:49 PM, Julien Nioche wrote:

> Hi Lewis

>

> Weve done quite a few things in 1.x since the previous release (e.g. generic 
> deduplication,

> removing indexer.solr package, etc...)  and the next 2.x release will be 
> after the changes to GORA

> have been made, tested and used on the Nutch side so that could be quite a 
> while.

>

> I am neutral as to whether we should do a 1.x release now. There are some 
> minor issues that we could

> do in 1.x before the next release like :

> * https://issues.apache.org/jira/browse/NUTCH-1360 
> 

> * https://issues.apache.org/jira/browse/NUTCH-1676 
> 

> and probably a few others but they could also be done later.

>

> Lets hear what others think.

>

> Thanks

>

> Julien

>

>

>

>

> On 28 November 2013 16:34, Lewis John Mcgibbney  

> >> wrote:

>

>     Hi Folks,

>     Thread says it all.

>     There are some hot tickets over in Gora right now so I think holding off 
> the next while for a

>     2.x release would be wise.

>     I can spin the RC for trunk tonight/tomorrow/weekend if we get the thumbs 
> up.

>     Ta

>     Lewis

>

>     --

>     /Lewis/

>

>

>

>

> --

> *

> *Open Source Solutions for Text Engineering

>

> http://digitalpebble.blogspot.com/ 

> http://www.digitalpebble.com 

> http://twitter.com/digitalpebble 

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/ 

http://www.digitalpebble.com 
http://twitter.com/digitalpebble 




Re: [DISCUSS] Release Trunk

2014-02-13 Thread Tejas Patil
Thanks Lewis. G+ hangout sounds cool. Is this wiki page complete and
updated to start off ?
http://wiki.apache.org/nutch/Release_HOWTO

Thanks,
Tejas


On Thu, Feb 13, 2014 at 12:23 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Folks,
> @Tejasp
>
> On Thu, Feb 13, 2014 at 6:30 AM,  wrote:
>
>> Just saw the commits since 1.7 release. Apart from trivial bug fixes, we
>> have some significant patches since 1.7.
>> +1 for new release. I would be happy to volunteer / help.
>>
>>
>>
> If you're game for learning the release manager role then I'm +1 to
> support you in that. We can do G+ hangout whilst you do it so that it all
> goes smoothly.
> If you change your mind just let me know and I'll push an RC today.
> Great work on trunk folks... lots of fixes ;)
> Lewis
>


[jira] [Commented] (NUTCH-1727) Length of the Tlds

2014-02-13 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900124#comment-13900124
 ] 

Lewis John McGibbney commented on NUTCH-1727:
-

Issue looks fine to me however some trivial unit tests would be nice.
This issue could also be applied to trunk. Any comments? 

> Length of the Tlds
> --
>
> Key: NUTCH-1727
> URL: https://issues.apache.org/jira/browse/NUTCH-1727
> Project: Nutch
>  Issue Type: Bug
>Reporter: Sertac TURKEL
>Priority: Minor
> Fix For: 2.1
>
> Attachments: NUTCH-1727.patch
>
>
> Length of the tld  should be selectable, there is some available tld's like 
> .travel and url-validator plugin filters this type of urls.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: [DISCUSS] Release Trunk

2014-02-13 Thread Lewis John Mcgibbney
Hi Folks,
@Tejasp

On Thu, Feb 13, 2014 at 6:30 AM,  wrote:

> Just saw the commits since 1.7 release. Apart from trivial bug fixes, we
> have some significant patches since 1.7.
> +1 for new release. I would be happy to volunteer / help.
>
>
>
If you're game for learning the release manager role then I'm +1 to support
you in that. We can do G+ hangout whilst you do it so that it all goes
smoothly.
If you change your mind just let me know and I'll push an RC today.
Great work on trunk folks... lots of fixes ;)
Lewis