[jira] [Updated] (NUTCH-1525) Generator to record external links even when db.ignore.external.links set to true

2014-02-14 Thread Dmitry Cherniachenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Cherniachenko updated NUTCH-1525:


Attachment: nutch-logExternal.patch

Attached the patch for Nutch 1.7

With it applied you can add the following to log4j.properties
{code}
log4j.logger.org.apache.nutch.parse.ParseOutputFormat.externalLinks=INFO,extlinks

log4j.appender.extlinks=org.apache.log4j.DailyRollingFileAppender
log4j.appender.extlinks.File=${hadoop.log.dir}/external-links.log
log4j.appender.extlinks.DatePattern=.-MM-dd
log4j.appender.extlinks.layout=org.apache.log4j.PatternLayout
log4j.appender.extlinks.layout.ConversionPattern=%m%n
{code}

And then all the ignored external links will be logged cleanly to 
external-links.log

 Generator to record external links even when  db.ignore.external.links set to 
 true
 --

 Key: NUTCH-1525
 URL: https://issues.apache.org/jira/browse/NUTCH-1525
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 2.4

 Attachments: nutch-logExternal.patch


 When fetching pages from specific domains we have various options e.g. use 
 urlfilters, set the above property to true before injecting urls into the 
 webdb etc. However with the former, it is recognised that complex regex can 
 slow down processing and with the latter it means we disregard a number of 
 urls which could potentially become useful in the future.
 Unfortunately there is no way to record external links encountered for future 
 processing, although the wiki suggests that a very small patch to the 
 generator code can allow you to log these links to hadoop.log. although this 
 is better, a more robusts storage mechanism would be preferred. This may tie 
 in with custom counters we've already specified or may require new counters 
 to be implemented.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1525) Generator to record external links even when db.ignore.external.links set to true

2014-02-14 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901313#comment-13901313
 ] 

Lewis John McGibbney commented on NUTCH-1525:
-

[~sabio], thank you for the patch. I totally forgot about this issue. 
Can we verify if we are able to derive Hadoop counters as well as/instead of 
simple logging?
If we can obtain counters then it is much easier to analyze the number of 
external links we filter.

 Generator to record external links even when  db.ignore.external.links set to 
 true
 --

 Key: NUTCH-1525
 URL: https://issues.apache.org/jira/browse/NUTCH-1525
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 2.4

 Attachments: nutch-logExternal.patch


 When fetching pages from specific domains we have various options e.g. use 
 urlfilters, set the above property to true before injecting urls into the 
 webdb etc. However with the former, it is recognised that complex regex can 
 slow down processing and with the latter it means we disregard a number of 
 urls which could potentially become useful in the future.
 Unfortunately there is no way to record external links encountered for future 
 processing, although the wiki suggests that a very small patch to the 
 generator code can allow you to log these links to hadoop.log. although this 
 is better, a more robusts storage mechanism would be preferred. This may tie 
 in with custom counters we've already specified or may require new counters 
 to be implemented.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes

2014-02-14 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901363#comment-13901363
 ] 

Markus Jelsma commented on NUTCH-1726:
--

Hi lufeng!

I don't understand, i have a clean Apache Nutch headings plugin, the same test 
fails for my patch and your patch. 

{code}
Testcase: testIt took 1.489 sec
Testcase: testMultiValueMetatags took 0.185 sec
FAILED
One value of metatag with multiple values is missing: Test header h2 with span
junit.framework.AssertionFailedError: One value of metatag with multiple values 
is missing: Test header h2 with span
at 
org.apache.nutch.parse.headings.TestHeadingsParseFilter.testMultiValueMetatags(TestHeadingsParseFilter.java:97)

{code}

I added truncate because perhaps some users may want to ignore long headers 
instead of truncating them. If i get a header containing 2kb of text, i think i 
would like to skip it, not truncate.

Markus

 HeadingsFilter does not find nested nodes
 -

 Key: NUTCH-1726
 URL: https://issues.apache.org/jira/browse/NUTCH-1726
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.8

 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, 
 NUTCH-1726-trunk.patch


 Filter won't find:
 {code}
 h1spanapache nutch/span/h1
 {code}
 The getNodeValue() tries to read data from children but should traverse nodes 
 instead.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: [DISCUSS] Release Trunk

2014-02-14 Thread Sebastian Nagel
Hi,

-1 also from me for now.

Beside SegmentMerger (NUTCH-1113) there is a problem in indexer 
(NUTCH-1706/NUTCH-1646)
which should be fixed. I hope to tackle both issues soon.

Sebastian


On 02/13/2014 10:19 AM, Markus Jelsma wrote:
 Seems some of my mails to the list are not coming through. I am -1 on release 
 from trunk as is. The segment merger is still broken and in my opinion we 
 cannot push yet another release with a broken segment merger.
 
 Markus
 
 -Original message-
 From: Tejas Patiltejas.patil...@gmail.com
 Sent: Thursday 13th February 2014 1:33
 To: dev@nutch.apache.org
 Subject: Re: [DISCUSS] Release Trunk
 
 Just saw the commits since 1.7 release. Apart from trivial bug fixes, we have 
 some significant patches since 1.7.
 
 +1 for new release. I would be happy to volunteer / help.
 
 Thanks,
 
 Tejas
 
 On Wed, Feb 12, 2014 at 7:33 AM, Julien Nioche lists.digitalpeb...@gmail.com 
 mailto:lists.digitalpeb...@gmail.com wrote:
 
 Hi guys,
 
 At least 2 of the issues that Seb and I had mentioned have now been 
 committed. What about releasing 1.8 from trunk? If so, any volunteers?
 
 Julien
 
 On 2 December 2013 21:02, Sebastian Nagel wastl.na...@googlemail.com 
 mailto:wastl.na...@googlemail.com wrote:
 
 Hi,
 
 +1 to release soon (this year, or early next year)
 
 and probably a few others but they could also be done later.
 
 At least, these should be done before releasing:
 
 NUTCH-1646 IndexerMapReduce to consider DB status
 
 NUTCH-1413 Record response time
 
 Sebastian
 
 On 11/28/2013 05:49 PM, Julien Nioche wrote:
 
 Hi Lewis
 

 
 Weve done quite a few things in 1.x since the previous release (e.g. generic 
 deduplication,
 
 removing indexer.solr package, etc...)  and the next 2.x release will be 
 after the changes to GORA
 
 have been made, tested and used on the Nutch side so that could be quite a 
 while.
 

 
 I am neutral as to whether we should do a 1.x release now. There are some 
 minor issues that we could
 
 do in 1.x before the next release like :
 
 * https://issues.apache.org/jira/browse/NUTCH-1360 
 https://issues.apache.org/jira/browse/NUTCH-1360
 
 * https://issues.apache.org/jira/browse/NUTCH-1676 
 https://issues.apache.org/jira/browse/NUTCH-1676
 
 and probably a few others but they could also be done later.
 

 
 Lets hear what others think.
 

 
 Thanks
 

 
 Julien
 

 

 

 

 
 On 28 November 2013 16:34, Lewis John Mcgibbney lewis.mcgibb...@gmail.com 
 mailto:lewis.mcgibb...@gmail.com
 
 mailto:lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com wrote:
 

 
 Hi Folks,
 
 Thread says it all.
 
 There are some hot tickets over in Gora right now so I think holding off 
 the next while for a
 
 2.x release would be wise.
 
 I can spin the RC for trunk tonight/tomorrow/weekend if we get the 
 thumbs up.
 
 Ta
 
 Lewis
 

 
 --
 
 /Lewis/
 

 

 

 

 
 --
 
 *
 
 *Open Source Solutions for Text Engineering
 

 
 http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/
 
 http://www.digitalpebble.com http://www.digitalpebble.com
 
 http://twitter.com/digitalpebble http://twitter.com/digitalpebble
 
 --
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/
 
 http://www.digitalpebble.com http://www.digitalpebble.com
 http://twitter.com/digitalpebble http://twitter.com/digitalpebble