date:20150129

[jira] [Commented] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata

2015-01-29 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297097#comment-14297097
 ] 

Lewis John McGibbney commented on NUTCH-1889:
-

+1

 Store all values from Tika metadata in Nutch metadata
 -

 Key: NUTCH-1889
 URL: https://issues.apache.org/jira/browse/NUTCH-1889
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Julien Nioche
Priority: Trivial
 Fix For: 1.10

 Attachments: NUTCH-1889.patch


 Tika metadata can be multivalued but we currently keep only the first value 
 in the TikaParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] Release Apache Nutch 1.10

2015-01-29 Thread Mattmann, Chris A (3980)

Thanks Lewis.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Thursday, January 29, 2015 at 12:09 AM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Re: [DISCUSS] Release Apache Nutch 1.10

Hi Folks,

So I've moved all remaining issues to Nutch 1.11
https://issues.apache.org/jira/browse/NUTCH/?selectedTab=com.atlassian.jir
a.jira-projects-plugin:roadmap-panel

Not sure if this is what we want the next trunk versioning to look like,
however it is an OK placeholder for the time being I hope.

If folks would like to see some particular patch make it in to trunk
before 1.10 then by all means please re-assign it to 1.10 and we can
review and get them in.

Thanks very much folks.
Lewis


On Wed, Jan 28, 2015 at 9:41 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:

Hi Folks,

https://issues.apache.org/jira/browse/NUTCH/?selectedTab=com.atlassian.jir
a.jira-projects-plugin:roadmap-panel


52 of 211 issues assigned against 1.10 looks pretty good to me and I
would be +1 for pushing a release.

Does anyone want to get any tickets in there? Does anyone have objections
to releasing?

Thanks
Lewis

-- 
Lewis












-- 
Lewis

Re: Option to disable Robots Rule checking

2015-01-29 Thread Mattmann, Chris A (3980)

Yay!

OK, I will go ahead and start work on it. Thank you all!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Markus Jelsma markus.jel...@openindex.io
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Thursday, January 29, 2015 at 3:35 AM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: RE: Option to disable Robots Rule checking

I am happy with is alternative! :)
 
-Original message-
 From:Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov
 Sent: Thursday 29th January 2015 1:21
 To: dev@nutch.apache.org dev@nutch.apache.org
 Subject: Re: Option to disable Robots Rule checking
 
 Seb I like this idea what do you think Lewis and Markus. Thanks that
would help me and my use case
 
 Sent from my iPhone
 
  On Jan 28, 2015, at 3:17 PM, Sebastian Nagel
wastl.na...@googlemail.com wrote:
  
  Hi Markus, hi Chris, hi Lewis,
  
  -1 from me
  
  A well-documented property is just an invitation to
  disable robots rules. A hidden property is also no
  alternative because it will be soon documented
  in our mailing lists or somewhere on the web.
  
  And shall we really remove or reformulate
  Our software obeys the robots.txt exclusion standard
  on http://nutch.apache.org/bot.html ?
  
  Since the agent string sent in the HTTP request always contains
/Nutch-x.x
  (it would require also a patch to change it) I wouldn't make
  it too easy to make Nutch ignore robots.txt.
  
  As you already stated too, we have properties in Nutch that can
  turn Nutch into a DDOS crawler with or without robots.txt rule
  parsing. We set these properties to *sensible defaults*.
  
  If the robots.txt is obeyed web masters can even prevent this
  by adding a Crawl-delay rule to their robots.txt.
  
  (from Chris):
  but there are also good [security research] uses of as well
  
  (from Lewis):
  I've met many web admins recently that want to search and index
  their entire DNS but do not wish to disable their robots.txt filter
  in order to do so.
  
  Ok, these are valid use cases. They have in common that
  the Nutch user owns the crawled servers or is (hopefully)
  explicitly allowed to perform the security research.
  
  
  What about an option (or config file) to exclude explicitly
  a list of hosts (or IPs) from robots.txt parsing?
  That would require more effort to configure than a boolean property
  but because it's explicit, it prevents users from disabling
  robots.txt in general and also guarantees that
  the security research is not accidentally extended.
  
  
  Cheers,
  Sebastian
  
  
  On 01/28/2015 01:54 AM, Mattmann, Chris A (3980) wrote:
  Hi Markus,
  
  Thanks for chiming in. I’m reading the below and I see you
  agree that it should be configurable, but you state that
  because Nutch is an Apache project, you dismiss the configuration
  option. What about it being an Apache project makes it any less
  ethical to simply have a configurable option that is turned off
  by default that allows the Robot rules to be disabled?
  
  For full disclosure I am looking into re-creating DDOS and other
  attacks doing some security research and so I have valid use cases
  here for wanting to do so. You state it’s easy to patch Nutch (you
  are correct for that matter, it’s a 2 line patch to Fetcher.java
  to disable the RobotRules check. However, how is it any less easy
  to have a 1 line patch that someone would have to apply to *override*
  the *default* behavior I’m suggesting of RobotRules being on in
  nutch-default.xml? So what I’m stating literally in code is:
  
  1. adding a property like nutch.robots.rules.parser and setting it’s
  default value to true, which enables the robot rules parser, putting
  this property say even at the bottom of nutch-default.xml and
  stating that improper use of this property in regular situations of
  whole web crawls can really hurt your crawling of a site.
  
  2. Having a check in Fetcher.java that checks for this property, if
it’s
  on, default behavior, if it’s off, skips the check.
  
  The benefit being you don’t encourage people like me (and lots of
  others that I’ve talked to) who would like to use Nutch for some
  security research for crawling to simply go fork it for a 1 line code
  change. Really? Is that what you want to encourage? The really
negative
  part about that is that it will encourage me to simply use that
forked
  version. I could maintain a patch file, and apply that, but

[jira] [Commented] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM

2015-01-29 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296621#comment-14296621
 ] 

Julien Nioche commented on NUTCH-1918:
--

Quite an important issue for those who extract data with Xpath and Tika, so I'd 
like to see it in 1.10
Will commit soon unless someone objects

 TikaParser specifies a default namespace when generating DOM
 

 Key: NUTCH-1918
 URL: https://issues.apache.org/jira/browse/NUTCH-1918
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Julien Nioche
 Fix For: 1.10

 Attachments: NUTCH-1918.patch


 The DOM generated by parse-tika differs from the one done by parse-html. 
 Ideally we should be able to use either parsers with the same XPath 
 expressions.
 This is related to [NUTCH-1592], but this time instead of being a matter of 
 uppercases, the problem comes from the namespace used. 
 This issue has been investigated and fixed in storm-crawler 
 [https://github.com/DigitalPebble/storm-crawler/pull/58].
 Here is what Guillaume explained there :
 bq. When parsing the content, Tika creates a properly formatted XHTML 
 document: all elements are created within the namespace XHTML.
 bq. However in XPath 1.0, there's no concept of default namespace so XPath 
 expressions such as //BODY doesn't match anything. To make this work we 
 should use //ns1:BODY and define a NamespaceContext which associates ns1 with 
 http://www.w3.org/1999/xhtml;
 bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is 
 our SaxHandler used to convert the SAX Events into a DOM tree to ignore a 
 default name space and the ParserBolt initializes it with the XHTML 
 namespace. This way //BODY matches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

2015-01-29 Thread Michiel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michiel updated NUTCH-1922:
---
Attachment: NUTCH-1922.patch

Patch from NUTCH-1679 adapted for implementation into 2.3

 DbUpdater overwrites fetch status for URLs from previous batches, causes 
 repeated re-fetches
 

 Key: NUTCH-1922
 URL: https://issues.apache.org/jira/browse/NUTCH-1922
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
Reporter: Gerhard Gossen
 Attachments: NUTCH-1922.patch


 When Nutch 2 finds a link to a URL that was crawled in a previous batch, it 
 resets the fetch status of that URL to {{unfetched}}. This makes this URL 
 available for a re-fetch, even if its crawl interval is not yet over.
 To reproduce, using version 2.3:
 {code}
 # Nutch configuration
 ant runtime
 cd runtime/local
 mkdir seeds
 echo http://www.l3s.de/~gossen/nutch/a.html  seeds/1.txt
 bin/crawl seeds test 2
 {code}
 This uses two files {{a.html}} and {{b.html}} that link to each other.
 In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. 
 In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. 
 This should update the score and link fields of {{a.html}}, but not the fetch 
 status. However, when I run {{bin/nutch readdb -crawlId test -url 
 http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns 
 {{status: 1 (status_unfetched)}}.
 Expected would be {{status: 2 (status_fetched)}}.
 The reason seems to be that DbUpdateReducer assumes that [links to a URL not 
 processed in the same batch always belong to new 
 pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109].
  Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate 
 job, but that change skipped all pages with a different batch ID, so I assume 
 that this introduced this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

2015-01-29 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296602#comment-14296602
 ] 

Lewis John McGibbney commented on NUTCH-1922:
-

[~Michiel] what are your thoughts on the [last 
comment|https://issues.apache.org/jira/browse/NUTCH-1679?focusedCommentId=14069567page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14069567]
 made by [~alxksn] on NUTCH-1679?

 DbUpdater overwrites fetch status for URLs from previous batches, causes 
 repeated re-fetches
 

 Key: NUTCH-1922
 URL: https://issues.apache.org/jira/browse/NUTCH-1922
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
Reporter: Gerhard Gossen
 Fix For: 2.4

 Attachments: NUTCH-1922.patch


 When Nutch 2 finds a link to a URL that was crawled in a previous batch, it 
 resets the fetch status of that URL to {{unfetched}}. This makes this URL 
 available for a re-fetch, even if its crawl interval is not yet over.
 To reproduce, using version 2.3:
 {code}
 # Nutch configuration
 ant runtime
 cd runtime/local
 mkdir seeds
 echo http://www.l3s.de/~gossen/nutch/a.html  seeds/1.txt
 bin/crawl seeds test 2
 {code}
 This uses two files {{a.html}} and {{b.html}} that link to each other.
 In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. 
 In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. 
 This should update the score and link fields of {{a.html}}, but not the fetch 
 status. However, when I run {{bin/nutch readdb -crawlId test -url 
 http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns 
 {{status: 1 (status_unfetched)}}.
 Expected would be {{status: 2 (status_fetched)}}.
 The reason seems to be that DbUpdateReducer assumes that [links to a URL not 
 processed in the same batch always belong to new 
 pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109].
  Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate 
 job, but that change skipped all pages with a different batch ID, so I assume 
 that this introduced this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata

2015-01-29 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296619#comment-14296619
 ] 

Julien Nioche commented on NUTCH-1889:
--

This one is quite trivial, I'd like to see it in 1.10
Will commit soon unless someone objects

 Store all values from Tika metadata in Nutch metadata
 -

 Key: NUTCH-1889
 URL: https://issues.apache.org/jira/browse/NUTCH-1889
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Julien Nioche
Priority: Trivial
 Fix For: 1.10

 Attachments: NUTCH-1889.patch


 Tika metadata can be multivalued but we currently keep only the first value 
 in the TikaParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata

2015-01-29 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1889:
-
Fix Version/s: (was: 1.11)
   1.10

 Store all values from Tika metadata in Nutch metadata
 -

 Key: NUTCH-1889
 URL: https://issues.apache.org/jira/browse/NUTCH-1889
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Julien Nioche
Priority: Trivial
 Fix For: 1.10

 Attachments: NUTCH-1889.patch


 Tika metadata can be multivalued but we currently keep only the first value 
 in the TikaParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM

2015-01-29 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1918:
-
Fix Version/s: (was: 1.11)
   1.10

 TikaParser specifies a default namespace when generating DOM
 

 Key: NUTCH-1918
 URL: https://issues.apache.org/jira/browse/NUTCH-1918
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Julien Nioche
 Fix For: 1.10

 Attachments: NUTCH-1918.patch


 The DOM generated by parse-tika differs from the one done by parse-html. 
 Ideally we should be able to use either parsers with the same XPath 
 expressions.
 This is related to [NUTCH-1592], but this time instead of being a matter of 
 uppercases, the problem comes from the namespace used. 
 This issue has been investigated and fixed in storm-crawler 
 [https://github.com/DigitalPebble/storm-crawler/pull/58].
 Here is what Guillaume explained there :
 bq. When parsing the content, Tika creates a properly formatted XHTML 
 document: all elements are created within the namespace XHTML.
 bq. However in XPath 1.0, there's no concept of default namespace so XPath 
 expressions such as //BODY doesn't match anything. To make this work we 
 should use //ns1:BODY and define a NamespaceContext which associates ns1 with 
 http://www.w3.org/1999/xhtml;
 bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is 
 our SaxHandler used to convert the SAX Events into a DOM tree to ignore a 
 default name space and the ParserBolt initializes it with the XHTML 
 namespace. This way //BODY matches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

2015-01-29 Thread Michiel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296613#comment-14296613
 ] 

Michiel commented on NUTCH-1922:


I'm afraid my current knowledge of Nutch does not extend far enough to say 
anything useful about that. Perhaps one of the original commentators on 
NUTCH-1679 can be approached to make further improvements on this patch / issue.

 DbUpdater overwrites fetch status for URLs from previous batches, causes 
 repeated re-fetches
 

 Key: NUTCH-1922
 URL: https://issues.apache.org/jira/browse/NUTCH-1922
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
Reporter: Gerhard Gossen
 Fix For: 2.4

 Attachments: NUTCH-1922.patch


 When Nutch 2 finds a link to a URL that was crawled in a previous batch, it 
 resets the fetch status of that URL to {{unfetched}}. This makes this URL 
 available for a re-fetch, even if its crawl interval is not yet over.
 To reproduce, using version 2.3:
 {code}
 # Nutch configuration
 ant runtime
 cd runtime/local
 mkdir seeds
 echo http://www.l3s.de/~gossen/nutch/a.html  seeds/1.txt
 bin/crawl seeds test 2
 {code}
 This uses two files {{a.html}} and {{b.html}} that link to each other.
 In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. 
 In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. 
 This should update the score and link fields of {{a.html}}, but not the fetch 
 status. However, when I run {{bin/nutch readdb -crawlId test -url 
 http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns 
 {{status: 1 (status_unfetched)}}.
 Expected would be {{status: 2 (status_fetched)}}.
 The reason seems to be that DbUpdateReducer assumes that [links to a URL not 
 processed in the same batch always belong to new 
 pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109].
  Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate 
 job, but that change skipped all pages with a different batch ID, so I assume 
 that this introduced this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1746) OutOfMemoryError in Mappers

2015-01-29 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1746:
-
Assignee: (was: Julien Nioche)

 OutOfMemoryError in Mappers
 ---

 Key: NUTCH-1746
 URL: https://issues.apache.org/jira/browse/NUTCH-1746
 Project: Nutch
  Issue Type: Bug
  Components: generator, injector
Affects Versions: 1.7
 Environment: Nutch running in local mode with 4M+ domains in 
 domain-urlfilter.txt
Reporter: Greg Padiasek
 Fix For: 1.11

 Attachments: Generator.patch, Injector.patch, ObjectCache.patch, 
 domain-urlfilter-aa, domain-urlfilter-ab, domain-urlfilter-ac


 Initially I found that Generator was throwing OutOfMemoryError exception no 
 matter how much RAM I allocated to JVM. I fixed the problem by moving 
 URLFilters, URLNormalizers and ScoringFilters to top-level class as 
 singletons and re-using them in all Generator mapper instances.
 Then I found the same problem in Injector and applied analogical fix.
 Now it seems that this issue may be common in all Nutch Mapper 
 implementations.
 I was wondering if it would it be possible to integrate this kind of change
 in the upstream code base and potentially update all vulnerable Mapper 
 classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2015-01-29 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1687:
-
Assignee: (was: Julien Nioche)

 Pick queue in Round Robin
 -

 Key: NUTCH-1687
 URL: https://issues.apache.org/jira/browse/NUTCH-1687
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 1.11

 Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch


 Currently we chose queue to pick url from start of queues list, so queue at 
 the start of list have more change to be pick first, that can cause problem 
 of long tail queue, which only few queue available at the end which have many 
 urls.
 public synchronized FetchItem getFetchItem() {
   final IteratorMap.EntryString, FetchItemQueue it =
 queues.entrySet().iterator(); == always reset to find queue from 
 start
   while (it.hasNext()) {
 
 I think it is better to pick queue in round robin, that can make reduce time 
 to find the available queue and make all queue was picked in round robin and 
 if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

2015-01-29 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296624#comment-14296624
 ] 

Lewis John McGibbney commented on NUTCH-1922:
-

No problems [~Michiel], I thought I would ask seeing as you seem to be testing 
and validating this patch and possibly even debugging the code. Thanks for your 
input this is most valuable.

 DbUpdater overwrites fetch status for URLs from previous batches, causes 
 repeated re-fetches
 

 Key: NUTCH-1922
 URL: https://issues.apache.org/jira/browse/NUTCH-1922
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
Reporter: Gerhard Gossen
 Fix For: 2.4

 Attachments: NUTCH-1922.patch


 When Nutch 2 finds a link to a URL that was crawled in a previous batch, it 
 resets the fetch status of that URL to {{unfetched}}. This makes this URL 
 available for a re-fetch, even if its crawl interval is not yet over.
 To reproduce, using version 2.3:
 {code}
 # Nutch configuration
 ant runtime
 cd runtime/local
 mkdir seeds
 echo http://www.l3s.de/~gossen/nutch/a.html  seeds/1.txt
 bin/crawl seeds test 2
 {code}
 This uses two files {{a.html}} and {{b.html}} that link to each other.
 In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. 
 In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. 
 This should update the score and link fields of {{a.html}}, but not the fetch 
 status. However, when I run {{bin/nutch readdb -crawlId test -url 
 http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns 
 {{status: 1 (status_unfetched)}}.
 Expected would be {{status: 2 (status_fetched)}}.
 The reason seems to be that DbUpdateReducer assumes that [links to a URL not 
 processed in the same batch always belong to new 
 pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109].
  Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate 
 job, but that change skipped all pages with a different batch ID, so I assume 
 that this introduced this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1477) NPE when injecting with DataFileAvroStore

2015-01-29 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1477:
-
Assignee: (was: Julien Nioche)

 NPE when injecting with DataFileAvroStore
 -

 Key: NUTCH-1477
 URL: https://issues.apache.org/jira/browse/NUTCH-1477
 Project: Nutch
  Issue Type: Bug
  Components: storage
Affects Versions: 2.1
 Environment: Java 1.6.0_35
Reporter: Mike Baranczak
Priority: Critical
 Fix For: 2.4

 Attachments: NUTCH-1477.patch, gora-core-0.2.1.jar, webpage.avsc, 
 webpage.avsc, webpage.avsc, webpage.avsc


 Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. 
 Injection job throws NullPointerException, see below. No error when I switch 
 to MemStore.
 java.lang.NullPointerException
   at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133)
   at 
 org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176)
   at 
 org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171)
   at 
 org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
   at 
 org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89)
   at 
 org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
   at 
 org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55)
   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245)
   at 
 org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54)
   at 
 org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60)
   at 
 org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
   at 
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
   at 
 org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185)
   at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2015-01-29 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-840:

Assignee: (was: Julien Nioche)

 Port tests from parse-html to parse-tika
 

 Key: NUTCH-840
 URL: https://issues.apache.org/jira/browse/NUTCH-840
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1, 1.6
Reporter: Julien Nioche
 Fix For: 2.4

 Attachments: NUTCH-840-trunk.patch, NUTCH-840.patch, NUTCH-840.patch, 
 NUTCH-840v2.patch


 We don't have test for HTML in parse-tika so I'll copy them from the old 
 parse-html plugin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml

2015-01-29 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1197:
-
Assignee: (was: Julien Nioche)

 Add statically configured field values to solrindex-mapping.xml
 ---

 Key: NUTCH-1197
 URL: https://issues.apache.org/jira/browse/NUTCH-1197
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
 Fix For: 1.11

 Attachments: NUTCH-1197.patch


 In some cases it's useful to be able to add to every document sent to Solr a 
 set of predefined fields with static values. This could be implemented on the 
 Solr side (with a custom UpdateRequestProcessor), but it may be less 
 cumbersome to add them on the Nutch side.
 Example: let's say I have several Nutch configurations all indexing to the 
 same Solr instance, and I want each of them to add its identifier as a field 
 in all documents, e.g. origin=web_crawl_1, origin=file_crawl, 
 origin=unlimited_crawl, etc...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata

2015-01-29 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1267:
-
Assignee: (was: Julien Nioche)

 urlmeta to delegate indexing to index-metadata
 --

 Key: NUTCH-1267
 URL: https://issues.apache.org/jira/browse/NUTCH-1267
 Project: Nutch
  Issue Type: Sub-task
  Components: indexer
Affects Versions: 1.6
Reporter: Julien Nioche
 Fix For: 1.11


 Ideally we should get rid of urlmeta altogether and add the transmission of 
 the meta to the outlinks in the core classes - not as a plugin. URLMeta is 
 also a terrible name :-(



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1269) Improve distribution of URLS with multi-segment generation

2015-01-29 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1269:
-
Assignee: (was: Julien Nioche)

 Improve distribution of URLS with multi-segment generation
 --

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: Generate, MaxHostCount, MaxNumSegments
 Fix For: 1.11

 Attachments: NUTCH-1269-v.2.patch, NUTCH-1269.patch


 there are some problems with current Generate method, with maxNumSegments and 
 maxHostCount options:
 1. first, size of generated segments are different
 2. with maxHostCount option, it is unclear that it was applied or not
 3. urls from one host are distributed non-uniform between segments
 we change Generator.java as described below:
 in Selector class:
 private int maxNumSegments;
 private int segmentSize;
 private int maxHostCount;
 public void config
 ...
   maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
   segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
   maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
 ...
 public void reduce(FloatWritable key, IteratorSelectorEntry values,
 OutputCollectorFloatWritable,SelectorEntry output, Reporter 
 reporter)
 throws IOException {
   int limit2=(int)((limit*3)/2);
   while (values.hasNext()) {
   if(count == limit)
 break;
 if (count % segmentSize == 0 ) {
   if (currentsegmentnum  maxNumSegments-1){
 currentsegmentnum++;
   }
   else
 currentsegmentnum=0;
 }
 boolean full=true;
 for(int jk=0;jkmaxNumSegments;jk++){
   if (segCounts[jk]segmentSize){
   full=false;
   }
 }
 if(full){
   break;
 }
 SelectorEntry entry = values.next();
 Text url = entry.url;
 //logWrite(Generated3:+limit+-+count+-+url.toString());
 String urlString = url.toString();
 URL u = null;
 String hostordomain = null;
 try {
   if (normalise  normalizers != null) {
 urlString = normalizers.normalize(urlString,
 URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
   }

   u = new URL(urlString);
   if (byDomain) {
 hostordomain = URLUtil.getDomainName(u);
   } else {
 hostordomain = new URL(urlString).getHost();
   }
  
   hostordomain = hostordomain.toLowerCase();
 boolean countLimit=true;
 // only filter if we are counting hosts or domains
  int[] hostCount = hostCounts.get(hostordomain);
  //host count: {a,b,c,d} means that from this host there are a 
 urls in segment 0 and b urls in seg 1 and ...
  if (hostCount == null) {
  hostCount = new int[maxNumSegments];
  for(int kl=0;klhostCount.length;kl++)
  hostCount[kl]=0;
  hostCounts.put(hostordomain, hostCount);
  }  
  int selectedSeg=currentsegmentnum;
  int minCount=hostCount[selectedSeg];
  for(int jk=0;jkmaxNumSegments;jk++){
  if(hostCount[jk]minCount){
  minCount=hostCount[jk];
  selectedSeg=jk;
  }
 }
 if(hostCount[selectedSeg]=maxHostCount){
 count++;
 entry.segnum = new IntWritable(selectedSeg);
 hostCount[selectedSeg]++;
 output.collect(key, entry);
 }
 } catch (Exception e) {
   LOG.warn(Malformed URL: ' + urlString + ', skipping (
 logWrite(Generate-malform:+hostordomain+-+url.toString());
   + StringUtils.stringifyException(e) + ));
   //continue;
 }
   }
 }
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1815) Metadata Parsed with parse-tika is Duplicated

2015-01-29 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1815:
-
Assignee: (was: Julien Nioche)

 Metadata Parsed with parse-tika is Duplicated
 -

 Key: NUTCH-1815
 URL: https://issues.apache.org/jira/browse/NUTCH-1815
 Project: Nutch
  Issue Type: Bug
  Components: indexer, parser
Affects Versions: 1.8
Reporter: Jonathan Cooper-Ellis
Priority: Minor
 Fix For: 1.11

 Attachments: NUTCH-1815-1.9.patch


 When Nutch is configured to parse metatags and index metadata from HTML 
 documents, disabling parse-html (and using parse-tika instead) causes each 
 metadata field to be indexed twice with identical content.
 I only modified plugin.includes (description and keywords metatags are 
 included in nutch-site.xml by default, so I did not modify those):
 property
   nameplugin.includes/name
   
 valueprotocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)/value
   description.../description
 /property
 Sample output:
 $ bin/nutch indexchecker 
 http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
 fetching: 
 http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
 parsing: 
 http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
 contentType: text/html
 content : Commonwealth Fund survey: Obamacare helped 9.5 million 
 Americans get health insurance, thanks to exc
 title :   Commonwealth Fund survey: Obamacare helped 9.5 million 
 Americans get health insurance, thanks to exc
 host :www.bizjournals.com
 tstamp :  Thu Jul 10 17:34:56 UTC 2014
 metatag.description : A new survey by the Commonwealth Fund found that 9.5 
 million previously uninsured Americans got cove
 metatag.description : A new survey by the Commonwealth Fund found that 9.5 
 million previously uninsured Americans got cove
 url : 
 http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-
 In this case, metatag.description appears twice. If parse-html is added back 
 to plugin.includes and the same command is run, metatag.description will only 
 appear once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [DISCUSS] Release Apache Nutch 1.10

2015-01-29 Thread Lewis John Mcgibbney

Hi Folks,
So I've moved all remaining issues to Nutch 1.11
https://issues.apache.org/jira/browse/NUTCH/?selectedTab=com.atlassian.jira.jira-projects-plugin:roadmap-panel
Not sure if this is what we want the next trunk versioning to look like,
however it is an OK placeholder for the time being I hope.
If folks would like to see some particular patch make it in to trunk before
1.10 then by all means please re-assign it to 1.10 and we can review and
get them in.
Thanks very much folks.
Lewis

On Wed, Jan 28, 2015 at 9:41 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Folks,


 https://issues.apache.org/jira/browse/NUTCH/?selectedTab=com.atlassian.jira.jira-projects-plugin:roadmap-panel

 52 of 211 issues assigned against 1.10 looks pretty good to me and I would
 be +1 for pushing a release.
 Does anyone want to get any tickets in there? Does anyone have objections
 to releasing?
 Thanks
 Lewis

 --
 *Lewis*




-- 
*Lewis*

[jira] [Updated] (NUTCH-477) Extend URLFilters to support different filtering chains

2015-01-29 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-477:

Assignee: (was: Julien Nioche)

 Extend URLFilters to support different filtering chains
 ---

 Key: NUTCH-477
 URL: https://issues.apache.org/jira/browse/NUTCH-477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.11

 Attachments: urlfilters.patch


 I propose to make the following changes to URLFilters:
 * extend URLFilters so that they support different filtering rules depending 
 on the context where they are executed. This functionality mirrors the one 
 that URLNormalizers already support.
 * change their return value to an int code, in order to support early 
 termination of long filtering chains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Option to disable Robots Rule checking

2015-01-29 Thread Lewis John Mcgibbney

Hi,

On Thu, Jan 29, 2015 at 12:49 AM, dev-digest-h...@nutch.apache.org wrote:


 Ok, these are valid use cases. They have in common that
 the Nutch user owns the crawled servers or is (hopefully)
 explicitly allowed to perform the security research.


Another example would be a backend storage migration for all crawl data for
one or more DNS. I've done migrations for clients before and being able to
override robots.txt in order to get this done in a timely fashion has been
mutually beneficial. So you are absolutely right here Seb :)




 What about an option (or config file) to exclude explicitly
 a list of hosts (or IPs) from robots.txt parsing?


Like a whilelist. Say if I know the IP(s) or hosts I want to override
robots.txt for (this can be easily obtained by turning on store.ip.address
property) then I could write the hosts and IPs to a flat file which would
then be overridden. Is this what you are suggesting?


 That would require more effort to configure than a boolean property
 but because it's explicit, it prevents users from disabling
 robots.txt in general and also guarantees that
 the security research is not accidentally extended


And possibly this would be activated by a boolean property e.g.
use.robots.override.whitelist? In all honesty Sebb, I think that this
sounds a better compromise as you said, it is explicit. It is still pretty
easy to configure right enough. All you need to do is use parsechecker for
example, log the IP, add it to new configuration file, then override. It
seems good.
It actually reminds me of one f the very first patches I tried to take
on... which is still open OMFG
https://issues.apache.org/jira/browse/NUTCH-208
I need to sort this patch out and commit... 4 years is a terrible duration
of time to have left that one hanging!

[jira] [Updated] (NUTCH-685) Content-level redirect status lost in ParseSegment

2015-01-29 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-685:

Assignee: (was: Julien Nioche)

 Content-level redirect status lost in ParseSegment
 --

 Key: NUTCH-685
 URL: https://issues.apache.org/jira/browse/NUTCH-685
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
 Fix For: 1.11


 When Fetcher runs in parsing mode, content-level redirects (HTML meta tag 
 Refresh) are properly discovered and recorded in crawl_fetch under source 
 URL and target URL. If Fetcher runs in non-parsing mode, and ParseSegment is 
 run as a separate step, the content-level redirection data is used only to 
 add the new (target) URL, but the status of the original URL is not reset to 
 indicate a redirect. Consequently, status of the original URL will be 
 different depending on the way you run Fetcher, whereas it should be the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

2015-01-29 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1922:

Fix Version/s: 2.4

 DbUpdater overwrites fetch status for URLs from previous batches, causes 
 repeated re-fetches
 

 Key: NUTCH-1922
 URL: https://issues.apache.org/jira/browse/NUTCH-1922
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
Reporter: Gerhard Gossen
 Fix For: 2.4

 Attachments: NUTCH-1922.patch


 When Nutch 2 finds a link to a URL that was crawled in a previous batch, it 
 resets the fetch status of that URL to {{unfetched}}. This makes this URL 
 available for a re-fetch, even if its crawl interval is not yet over.
 To reproduce, using version 2.3:
 {code}
 # Nutch configuration
 ant runtime
 cd runtime/local
 mkdir seeds
 echo http://www.l3s.de/~gossen/nutch/a.html  seeds/1.txt
 bin/crawl seeds test 2
 {code}
 This uses two files {{a.html}} and {{b.html}} that link to each other.
 In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. 
 In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. 
 This should update the score and link fields of {{a.html}}, but not the fetch 
 status. However, when I run {{bin/nutch readdb -crawlId test -url 
 http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns 
 {{status: 1 (status_unfetched)}}.
 Expected would be {{status: 2 (status_fetched)}}.
 The reason seems to be that DbUpdateReducer assumes that [links to a URL not 
 processed in the same batch always belong to new 
 pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109].
  Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate 
 job, but that change skipped all pages with a different batch ID, so I assume 
 that this introduced this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

RE: Option to disable Robots Rule checking

2015-01-29 Thread Markus Jelsma

I am happy with is alternative! :)

-Original message-
 From:Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov
 Sent: Thursday 29th January 2015 1:21
 To: dev@nutch.apache.org dev@nutch.apache.org
 Subject: Re: Option to disable Robots Rule checking

 Seb I like this idea what do you think Lewis and Markus. Thanks that would 
 help me and my use case 

 Sent from my iPhone

  On Jan 28, 2015, at 3:17 PM, Sebastian Nagel wastl.na...@googlemail.com 
  wrote:

  Hi Markus, hi Chris, hi Lewis,

  -1 from me

  A well-documented property is just an invitation to
  disable robots rules. A hidden property is also no
  alternative because it will be soon documented
  in our mailing lists or somewhere on the web.

  And shall we really remove or reformulate
  Our software obeys the robots.txt exclusion standard
  on http://nutch.apache.org/bot.html ?

  Since the agent string sent in the HTTP request always contains /Nutch-x.x
  (it would require also a patch to change it) I wouldn't make
  it too easy to make Nutch ignore robots.txt.

  As you already stated too, we have properties in Nutch that can
  turn Nutch into a DDOS crawler with or without robots.txt rule
  parsing. We set these properties to *sensible defaults*.

  If the robots.txt is obeyed web masters can even prevent this
  by adding a Crawl-delay rule to their robots.txt.

  (from Chris):
  but there are also good [security research] uses of as well

  (from Lewis):
  I've met many web admins recently that want to search and index
  their entire DNS but do not wish to disable their robots.txt filter
  in order to do so.

  Ok, these are valid use cases. They have in common that
  the Nutch user owns the crawled servers or is (hopefully)
  explicitly allowed to perform the security research.

  What about an option (or config file) to exclude explicitly
  a list of hosts (or IPs) from robots.txt parsing?
  That would require more effort to configure than a boolean property
  but because it's explicit, it prevents users from disabling
  robots.txt in general and also guarantees that
  the security research is not accidentally extended.

  Cheers,
  Sebastian

  On 01/28/2015 01:54 AM, Mattmann, Chris A (3980) wrote:
  Hi Markus,

  Thanks for chiming in. I’m reading the below and I see you
  agree that it should be configurable, but you state that
  because Nutch is an Apache project, you dismiss the configuration
  option. What about it being an Apache project makes it any less
  ethical to simply have a configurable option that is turned off
  by default that allows the Robot rules to be disabled?

  For full disclosure I am looking into re-creating DDOS and other
  attacks doing some security research and so I have valid use cases
  here for wanting to do so. You state it’s easy to patch Nutch (you
  are correct for that matter, it’s a 2 line patch to Fetcher.java
  to disable the RobotRules check. However, how is it any less easy
  to have a 1 line patch that someone would have to apply to *override*
  the *default* behavior I’m suggesting of RobotRules being on in
  nutch-default.xml? So what I’m stating literally in code is:

  1. adding a property like nutch.robots.rules.parser and setting it’s
  default value to true, which enables the robot rules parser, putting
  this property say even at the bottom of nutch-default.xml and
  stating that improper use of this property in regular situations of
  whole web crawls can really hurt your crawling of a site.

  2. Having a check in Fetcher.java that checks for this property, if it’s
  on, default behavior, if it’s off, skips the check.

  The benefit being you don’t encourage people like me (and lots of
  others that I’ve talked to) who would like to use Nutch for some
  security research for crawling to simply go fork it for a 1 line code
  change. Really? Is that what you want to encourage? The really negative
  part about that is that it will encourage me to simply use that forked
  version. I could maintain a patch file, and apply that, but it’s going
  to fall out of date with updates to Nutch, I’m going to have to update that
  patch file if nutch-default.xml changes (and so will other people, etc.)

  As you already stated too, we have properties in Nutch that can
  turn Nutch into a DDOS crawler with or without robots.txt rule
  parsing. We set these properties to *sensible defaults*. I’m proposing
  a compromise that helps people like me; encourages me to keep using
  Nutch through simplification; and is no less worse that the few other
  properties that we already expose in Nutch configuration to allow it
  to be turned into a DDOS bot (which by the way, there are bad uses of,
  but there are also good [security research] uses of as well, to prevent
  the bad guys).

  I appreciate it if you made it this far and hope you will reconsider.

  Cheers,
  Chris

Re: Blog topic: Maxmind's GeoIP2 API being used in Apache Nutch 1.10

2015-01-29 Thread Lewis John Mcgibbney

Hi Susan,
Just acknowledging this email. I will write this up during my lunch hour
today.
Thanks
lewis

On Thu, Jan 29, 2015 at 6:36 AM, Susan Fendrock sfendr...@maxmind.com
wrote:

 Hello Lewis!

 Thanks for getting in touch with us about potentially providing a
 contribution to our blog.

 Could you provide a brief summary of the blog post you are envisioning?

 Look forward to learning more about your project,

 Susan


 --
 Susan Fendrock
 Product Marketing
 MaxMind, Inc.

 617-500-4493 ext. 820




-- 
*Lewis*

[jira] [Assigned] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-01-29 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-1927:


Assignee: Chris A. Mattmann

 Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
 ---

 Key: NUTCH-1927
 URL: https://issues.apache.org/jira/browse/NUTCH-1927
 Project: Nutch
  Issue Type: Bug
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann

 Based on discussion on the dev list, to use Nutch for some security research 
 valid use cases (DDoS; DNS and other testing), I am going to create a patch 
 that allows a whitelist:
 {code:xml}
 property
   namerobot.rules.whitelist/name
   value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value
   descriptionComma separated list of hostnames or IP addresses to ignore 
 robot rules parsing for.
   /description
 /property
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-01-29 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1927 started by Chris A. Mattmann.

 Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
 ---

 Key: NUTCH-1927
 URL: https://issues.apache.org/jira/browse/NUTCH-1927
 Project: Nutch
  Issue Type: Bug
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann

 Based on discussion on the dev list, to use Nutch for some security research 
 valid use cases (DDoS; DNS and other testing), I am going to create a patch 
 that allows a whitelist:
 {code:xml}
 property
   namerobot.rules.whitelist/name
   value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value
   descriptionComma separated list of hostnames or IP addresses to ignore 
 robot rules parsing for.
   /description
 /property
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM

2015-01-29 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297701#comment-14297701
 ] 

Sebastian Nagel commented on NUTCH-1918:


+1

 TikaParser specifies a default namespace when generating DOM
 

 Key: NUTCH-1918
 URL: https://issues.apache.org/jira/browse/NUTCH-1918
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Julien Nioche
 Fix For: 1.10

 Attachments: NUTCH-1918.patch


 The DOM generated by parse-tika differs from the one done by parse-html. 
 Ideally we should be able to use either parsers with the same XPath 
 expressions.
 This is related to [NUTCH-1592], but this time instead of being a matter of 
 uppercases, the problem comes from the namespace used. 
 This issue has been investigated and fixed in storm-crawler 
 [https://github.com/DigitalPebble/storm-crawler/pull/58].
 Here is what Guillaume explained there :
 bq. When parsing the content, Tika creates a properly formatted XHTML 
 document: all elements are created within the namespace XHTML.
 bq. However in XPath 1.0, there's no concept of default namespace so XPath 
 expressions such as //BODY doesn't match anything. To make this work we 
 should use //ns1:BODY and define a NamespaceContext which associates ns1 with 
 http://www.w3.org/1999/xhtml;
 bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is 
 our SaxHandler used to convert the SAX Events into a DOM tree to ignore a 
 default name space and the ParserBolt initializes it with the XHTML 
 namespace. This way //BODY matches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-01-29 Thread Chris A. Mattmann (JIRA)

Chris A. Mattmann created NUTCH-1927:


 Summary: Create a whitelist of IPs/hostnames to allow skipping of 
RobotRules parsing
 Key: NUTCH-1927
 URL: https://issues.apache.org/jira/browse/NUTCH-1927
 Project: Nutch
  Issue Type: Bug
Reporter: Chris A. Mattmann


Based on discussion on the dev list, to use Nutch for some security research 
valid use cases (DDoS; DNS and other testing), I am going to create a patch 
that allows a whitelist:

{code:xml}
property
  namerobot.rules.whitelist/name
  value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value
  descriptionComma separated list of hostnames or IP addresses to ignore 
robot rules parsing for.
  /description
/property
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work stopped] (NUTCH-1924) Nutch + HBase Docker

2015-01-29 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/NUTCH-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1924 stopped by Radosław Stankiewicz.
---
 Nutch + HBase Docker
 

 Key: NUTCH-1924
 URL: https://issues.apache.org/jira/browse/NUTCH-1924
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Reporter: Lewis John McGibbney
Assignee: Radosław Stankiewicz
 Fix For: 2.4


 ZooKeeper 3.4.5 Hadoop 0.20.204 HBase 0.90.4 Nutch 2.2.1
 https://registry.hub.docker.com/u/stankiewicz/hbase_hadoop_nutch/dockerfile/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1924) Nutch + HBase Docker

2015-01-29 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297888#comment-14297888
 ] 

Lewis John McGibbney commented on NUTCH-1924:
-

hi [~rrydziu] Nutch currently uses SVN for SCM.
You can checkout the source 
[here|http://svn.apache.org/repos/asf/nutch/branches/2.x/]
You can merely add you contribution there and then attach a patch here

{code}
svn co http://svn.apache.org/repos/asf/nutch/branches/2.x/
cd 2.x
cp -r /path/to/docker_container 2.x/docker/hbase
svn add 2.x/docker/hbase
svn diff  NUTCH-1924.patch
{code}

Great work so far. We are getting close to a 0.6 release of Gora which would 
mean that we can upgrade HBase to 0.98.X version.

 Nutch + HBase Docker
 

 Key: NUTCH-1924
 URL: https://issues.apache.org/jira/browse/NUTCH-1924
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Reporter: Lewis John McGibbney
Assignee: Radosław Stankiewicz
 Fix For: 2.4


 ZooKeeper 3.4.5 Hadoop 0.20.204 HBase 0.90.4 Nutch 2.2.1
 https://registry.hub.docker.com/u/stankiewicz/hbase_hadoop_nutch/dockerfile/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata

Re: [DISCUSS] Release Apache Nutch 1.10

Re: Option to disable Robots Rule checking

[jira] [Commented] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM

[jira] [Updated] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

[jira] [Commented] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata

[jira] [Updated] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata

[jira] [Updated] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM

[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

[jira] [Updated] (NUTCH-1746) OutOfMemoryError in Mappers

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

[jira] [Updated] (NUTCH-1477) NPE when injecting with DataFileAvroStore

[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

[jira] [Updated] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml

[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata

[jira] [Updated] (NUTCH-1269) Improve distribution of URLS with multi-segment generation

[jira] [Updated] (NUTCH-1815) Metadata Parsed with parse-tika is Duplicated

Re: [DISCUSS] Release Apache Nutch 1.10

[jira] [Updated] (NUTCH-477) Extend URLFilters to support different filtering chains

Re: Option to disable Robots Rule checking

[jira] [Updated] (NUTCH-685) Content-level redirect status lost in ParseSegment

[jira] [Updated] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

RE: Option to disable Robots Rule checking

Re: Blog topic: Maxmind's GeoIP2 API being used in Apache Nutch 1.10

[jira] [Assigned] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

[jira] [Work started] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

[jira] [Commented] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM

[jira] [Created] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

[jira] [Work stopped] (NUTCH-1924) Nutch + HBase Docker

[jira] [Commented] (NUTCH-1924) Nutch + HBase Docker

32 matches

Site Navigation

Mail list logo

Footer information