[jira] [Commented] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata
[ https://issues.apache.org/jira/browse/NUTCH-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297097#comment-14297097 ] Lewis John McGibbney commented on NUTCH-1889: - +1 Store all values from Tika metadata in Nutch metadata - Key: NUTCH-1889 URL: https://issues.apache.org/jira/browse/NUTCH-1889 Project: Nutch Issue Type: Improvement Components: parser Reporter: Julien Nioche Priority: Trivial Fix For: 1.10 Attachments: NUTCH-1889.patch Tika metadata can be multivalued but we currently keep only the first value in the TikaParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Release Apache Nutch 1.10
Thanks Lewis. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Thursday, January 29, 2015 at 12:09 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: [DISCUSS] Release Apache Nutch 1.10 Hi Folks, So I've moved all remaining issues to Nutch 1.11 https://issues.apache.org/jira/browse/NUTCH/?selectedTab=com.atlassian.jir a.jira-projects-plugin:roadmap-panel Not sure if this is what we want the next trunk versioning to look like, however it is an OK placeholder for the time being I hope. If folks would like to see some particular patch make it in to trunk before 1.10 then by all means please re-assign it to 1.10 and we can review and get them in. Thanks very much folks. Lewis On Wed, Jan 28, 2015 at 9:41 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Folks, https://issues.apache.org/jira/browse/NUTCH/?selectedTab=com.atlassian.jir a.jira-projects-plugin:roadmap-panel 52 of 211 issues assigned against 1.10 looks pretty good to me and I would be +1 for pushing a release. Does anyone want to get any tickets in there? Does anyone have objections to releasing? Thanks Lewis -- Lewis -- Lewis
Re: Option to disable Robots Rule checking
Yay! OK, I will go ahead and start work on it. Thank you all! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Markus Jelsma markus.jel...@openindex.io Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Thursday, January 29, 2015 at 3:35 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: RE: Option to disable Robots Rule checking I am happy with is alternative! :) -Original message- From:Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov Sent: Thursday 29th January 2015 1:21 To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: Option to disable Robots Rule checking Seb I like this idea what do you think Lewis and Markus. Thanks that would help me and my use case Sent from my iPhone On Jan 28, 2015, at 3:17 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Markus, hi Chris, hi Lewis, -1 from me A well-documented property is just an invitation to disable robots rules. A hidden property is also no alternative because it will be soon documented in our mailing lists or somewhere on the web. And shall we really remove or reformulate Our software obeys the robots.txt exclusion standard on http://nutch.apache.org/bot.html ? Since the agent string sent in the HTTP request always contains /Nutch-x.x (it would require also a patch to change it) I wouldn't make it too easy to make Nutch ignore robots.txt. As you already stated too, we have properties in Nutch that can turn Nutch into a DDOS crawler with or without robots.txt rule parsing. We set these properties to *sensible defaults*. If the robots.txt is obeyed web masters can even prevent this by adding a Crawl-delay rule to their robots.txt. (from Chris): but there are also good [security research] uses of as well (from Lewis): I've met many web admins recently that want to search and index their entire DNS but do not wish to disable their robots.txt filter in order to do so. Ok, these are valid use cases. They have in common that the Nutch user owns the crawled servers or is (hopefully) explicitly allowed to perform the security research. What about an option (or config file) to exclude explicitly a list of hosts (or IPs) from robots.txt parsing? That would require more effort to configure than a boolean property but because it's explicit, it prevents users from disabling robots.txt in general and also guarantees that the security research is not accidentally extended. Cheers, Sebastian On 01/28/2015 01:54 AM, Mattmann, Chris A (3980) wrote: Hi Markus, Thanks for chiming in. I’m reading the below and I see you agree that it should be configurable, but you state that because Nutch is an Apache project, you dismiss the configuration option. What about it being an Apache project makes it any less ethical to simply have a configurable option that is turned off by default that allows the Robot rules to be disabled? For full disclosure I am looking into re-creating DDOS and other attacks doing some security research and so I have valid use cases here for wanting to do so. You state it’s easy to patch Nutch (you are correct for that matter, it’s a 2 line patch to Fetcher.java to disable the RobotRules check. However, how is it any less easy to have a 1 line patch that someone would have to apply to *override* the *default* behavior I’m suggesting of RobotRules being on in nutch-default.xml? So what I’m stating literally in code is: 1. adding a property like nutch.robots.rules.parser and setting it’s default value to true, which enables the robot rules parser, putting this property say even at the bottom of nutch-default.xml and stating that improper use of this property in regular situations of whole web crawls can really hurt your crawling of a site. 2. Having a check in Fetcher.java that checks for this property, if it’s on, default behavior, if it’s off, skips the check. The benefit being you don’t encourage people like me (and lots of others that I’ve talked to) who would like to use Nutch for some security research for crawling to simply go fork it for a 1 line code change. Really? Is that what you want to encourage? The really negative part about that is that it will encourage me to simply use that forked version. I could maintain a patch file, and apply that, but
[jira] [Commented] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM
[ https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296621#comment-14296621 ] Julien Nioche commented on NUTCH-1918: -- Quite an important issue for those who extract data with Xpath and Tika, so I'd like to see it in 1.10 Will commit soon unless someone objects TikaParser specifies a default namespace when generating DOM Key: NUTCH-1918 URL: https://issues.apache.org/jira/browse/NUTCH-1918 Project: Nutch Issue Type: Bug Components: parser Reporter: Julien Nioche Fix For: 1.10 Attachments: NUTCH-1918.patch The DOM generated by parse-tika differs from the one done by parse-html. Ideally we should be able to use either parsers with the same XPath expressions. This is related to [NUTCH-1592], but this time instead of being a matter of uppercases, the problem comes from the namespace used. This issue has been investigated and fixed in storm-crawler [https://github.com/DigitalPebble/storm-crawler/pull/58]. Here is what Guillaume explained there : bq. When parsing the content, Tika creates a properly formatted XHTML document: all elements are created within the namespace XHTML. bq. However in XPath 1.0, there's no concept of default namespace so XPath expressions such as //BODY doesn't match anything. To make this work we should use //ns1:BODY and define a NamespaceContext which associates ns1 with http://www.w3.org/1999/xhtml; bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is our SaxHandler used to convert the SAX Events into a DOM tree to ignore a default name space and the ParserBolt initializes it with the XHTML namespace. This way //BODY matches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches
[ https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michiel updated NUTCH-1922: --- Attachment: NUTCH-1922.patch Patch from NUTCH-1679 adapted for implementation into 2.3 DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches Key: NUTCH-1922 URL: https://issues.apache.org/jira/browse/NUTCH-1922 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Reporter: Gerhard Gossen Attachments: NUTCH-1922.patch When Nutch 2 finds a link to a URL that was crawled in a previous batch, it resets the fetch status of that URL to {{unfetched}}. This makes this URL available for a re-fetch, even if its crawl interval is not yet over. To reproduce, using version 2.3: {code} # Nutch configuration ant runtime cd runtime/local mkdir seeds echo http://www.l3s.de/~gossen/nutch/a.html seeds/1.txt bin/crawl seeds test 2 {code} This uses two files {{a.html}} and {{b.html}} that link to each other. In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. This should update the score and link fields of {{a.html}}, but not the fetch status. However, when I run {{bin/nutch readdb -crawlId test -url http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns {{status: 1 (status_unfetched)}}. Expected would be {{status: 2 (status_fetched)}}. The reason seems to be that DbUpdateReducer assumes that [links to a URL not processed in the same batch always belong to new pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109]. Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate job, but that change skipped all pages with a different batch ID, so I assume that this introduced this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches
[ https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296602#comment-14296602 ] Lewis John McGibbney commented on NUTCH-1922: - [~Michiel] what are your thoughts on the [last comment|https://issues.apache.org/jira/browse/NUTCH-1679?focusedCommentId=14069567page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14069567] made by [~alxksn] on NUTCH-1679? DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches Key: NUTCH-1922 URL: https://issues.apache.org/jira/browse/NUTCH-1922 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Reporter: Gerhard Gossen Fix For: 2.4 Attachments: NUTCH-1922.patch When Nutch 2 finds a link to a URL that was crawled in a previous batch, it resets the fetch status of that URL to {{unfetched}}. This makes this URL available for a re-fetch, even if its crawl interval is not yet over. To reproduce, using version 2.3: {code} # Nutch configuration ant runtime cd runtime/local mkdir seeds echo http://www.l3s.de/~gossen/nutch/a.html seeds/1.txt bin/crawl seeds test 2 {code} This uses two files {{a.html}} and {{b.html}} that link to each other. In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. This should update the score and link fields of {{a.html}}, but not the fetch status. However, when I run {{bin/nutch readdb -crawlId test -url http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns {{status: 1 (status_unfetched)}}. Expected would be {{status: 2 (status_fetched)}}. The reason seems to be that DbUpdateReducer assumes that [links to a URL not processed in the same batch always belong to new pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109]. Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate job, but that change skipped all pages with a different batch ID, so I assume that this introduced this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata
[ https://issues.apache.org/jira/browse/NUTCH-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296619#comment-14296619 ] Julien Nioche commented on NUTCH-1889: -- This one is quite trivial, I'd like to see it in 1.10 Will commit soon unless someone objects Store all values from Tika metadata in Nutch metadata - Key: NUTCH-1889 URL: https://issues.apache.org/jira/browse/NUTCH-1889 Project: Nutch Issue Type: Improvement Components: parser Reporter: Julien Nioche Priority: Trivial Fix For: 1.10 Attachments: NUTCH-1889.patch Tika metadata can be multivalued but we currently keep only the first value in the TikaParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata
[ https://issues.apache.org/jira/browse/NUTCH-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1889: - Fix Version/s: (was: 1.11) 1.10 Store all values from Tika metadata in Nutch metadata - Key: NUTCH-1889 URL: https://issues.apache.org/jira/browse/NUTCH-1889 Project: Nutch Issue Type: Improvement Components: parser Reporter: Julien Nioche Priority: Trivial Fix For: 1.10 Attachments: NUTCH-1889.patch Tika metadata can be multivalued but we currently keep only the first value in the TikaParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM
[ https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1918: - Fix Version/s: (was: 1.11) 1.10 TikaParser specifies a default namespace when generating DOM Key: NUTCH-1918 URL: https://issues.apache.org/jira/browse/NUTCH-1918 Project: Nutch Issue Type: Bug Components: parser Reporter: Julien Nioche Fix For: 1.10 Attachments: NUTCH-1918.patch The DOM generated by parse-tika differs from the one done by parse-html. Ideally we should be able to use either parsers with the same XPath expressions. This is related to [NUTCH-1592], but this time instead of being a matter of uppercases, the problem comes from the namespace used. This issue has been investigated and fixed in storm-crawler [https://github.com/DigitalPebble/storm-crawler/pull/58]. Here is what Guillaume explained there : bq. When parsing the content, Tika creates a properly formatted XHTML document: all elements are created within the namespace XHTML. bq. However in XPath 1.0, there's no concept of default namespace so XPath expressions such as //BODY doesn't match anything. To make this work we should use //ns1:BODY and define a NamespaceContext which associates ns1 with http://www.w3.org/1999/xhtml; bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is our SaxHandler used to convert the SAX Events into a DOM tree to ignore a default name space and the ParserBolt initializes it with the XHTML namespace. This way //BODY matches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches
[ https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296613#comment-14296613 ] Michiel commented on NUTCH-1922: I'm afraid my current knowledge of Nutch does not extend far enough to say anything useful about that. Perhaps one of the original commentators on NUTCH-1679 can be approached to make further improvements on this patch / issue. DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches Key: NUTCH-1922 URL: https://issues.apache.org/jira/browse/NUTCH-1922 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Reporter: Gerhard Gossen Fix For: 2.4 Attachments: NUTCH-1922.patch When Nutch 2 finds a link to a URL that was crawled in a previous batch, it resets the fetch status of that URL to {{unfetched}}. This makes this URL available for a re-fetch, even if its crawl interval is not yet over. To reproduce, using version 2.3: {code} # Nutch configuration ant runtime cd runtime/local mkdir seeds echo http://www.l3s.de/~gossen/nutch/a.html seeds/1.txt bin/crawl seeds test 2 {code} This uses two files {{a.html}} and {{b.html}} that link to each other. In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. This should update the score and link fields of {{a.html}}, but not the fetch status. However, when I run {{bin/nutch readdb -crawlId test -url http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns {{status: 1 (status_unfetched)}}. Expected would be {{status: 2 (status_fetched)}}. The reason seems to be that DbUpdateReducer assumes that [links to a URL not processed in the same batch always belong to new pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109]. Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate job, but that change skipped all pages with a different batch ID, so I assume that this introduced this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1746) OutOfMemoryError in Mappers
[ https://issues.apache.org/jira/browse/NUTCH-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1746: - Assignee: (was: Julien Nioche) OutOfMemoryError in Mappers --- Key: NUTCH-1746 URL: https://issues.apache.org/jira/browse/NUTCH-1746 Project: Nutch Issue Type: Bug Components: generator, injector Affects Versions: 1.7 Environment: Nutch running in local mode with 4M+ domains in domain-urlfilter.txt Reporter: Greg Padiasek Fix For: 1.11 Attachments: Generator.patch, Injector.patch, ObjectCache.patch, domain-urlfilter-aa, domain-urlfilter-ab, domain-urlfilter-ac Initially I found that Generator was throwing OutOfMemoryError exception no matter how much RAM I allocated to JVM. I fixed the problem by moving URLFilters, URLNormalizers and ScoringFilters to top-level class as singletons and re-using them in all Generator mapper instances. Then I found the same problem in Injector and applied analogical fix. Now it seems that this issue may be common in all Nutch Mapper implementations. I was wondering if it would it be possible to integrate this kind of change in the upstream code base and potentially update all vulnerable Mapper classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1687: - Assignee: (was: Julien Nioche) Pick queue in Round Robin - Key: NUTCH-1687 URL: https://issues.apache.org/jira/browse/NUTCH-1687 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Tien Nguyen Manh Priority: Minor Fix For: 1.11 Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch Currently we chose queue to pick url from start of queues list, so queue at the start of list have more change to be pick first, that can cause problem of long tail queue, which only few queue available at the end which have many urls. public synchronized FetchItem getFetchItem() { final IteratorMap.EntryString, FetchItemQueue it = queues.entrySet().iterator(); == always reset to find queue from start while (it.hasNext()) { I think it is better to pick queue in round robin, that can make reduce time to find the available queue and make all queue was picked in round robin and if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches
[ https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296624#comment-14296624 ] Lewis John McGibbney commented on NUTCH-1922: - No problems [~Michiel], I thought I would ask seeing as you seem to be testing and validating this patch and possibly even debugging the code. Thanks for your input this is most valuable. DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches Key: NUTCH-1922 URL: https://issues.apache.org/jira/browse/NUTCH-1922 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Reporter: Gerhard Gossen Fix For: 2.4 Attachments: NUTCH-1922.patch When Nutch 2 finds a link to a URL that was crawled in a previous batch, it resets the fetch status of that URL to {{unfetched}}. This makes this URL available for a re-fetch, even if its crawl interval is not yet over. To reproduce, using version 2.3: {code} # Nutch configuration ant runtime cd runtime/local mkdir seeds echo http://www.l3s.de/~gossen/nutch/a.html seeds/1.txt bin/crawl seeds test 2 {code} This uses two files {{a.html}} and {{b.html}} that link to each other. In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. This should update the score and link fields of {{a.html}}, but not the fetch status. However, when I run {{bin/nutch readdb -crawlId test -url http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns {{status: 1 (status_unfetched)}}. Expected would be {{status: 2 (status_fetched)}}. The reason seems to be that DbUpdateReducer assumes that [links to a URL not processed in the same batch always belong to new pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109]. Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate job, but that change skipped all pages with a different batch ID, so I assume that this introduced this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1477) NPE when injecting with DataFileAvroStore
[ https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1477: - Assignee: (was: Julien Nioche) NPE when injecting with DataFileAvroStore - Key: NUTCH-1477 URL: https://issues.apache.org/jira/browse/NUTCH-1477 Project: Nutch Issue Type: Bug Components: storage Affects Versions: 2.1 Environment: Java 1.6.0_35 Reporter: Mike Baranczak Priority: Critical Fix For: 2.4 Attachments: NUTCH-1477.patch, gora-core-0.2.1.jar, webpage.avsc, webpage.avsc, webpage.avsc, webpage.avsc Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. Injection job throws NullPointerException, see below. No error when I switch to MemStore. java.lang.NullPointerException at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133) at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176) at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72) at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62) at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55) at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245) at org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54) at org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185) at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-840: Assignee: (was: Julien Nioche) Port tests from parse-html to parse-tika Key: NUTCH-840 URL: https://issues.apache.org/jira/browse/NUTCH-840 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.1, 1.6 Reporter: Julien Nioche Fix For: 2.4 Attachments: NUTCH-840-trunk.patch, NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch We don't have test for HTML in parse-tika so I'll copy them from the old parse-html plugin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml
[ https://issues.apache.org/jira/browse/NUTCH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1197: - Assignee: (was: Julien Nioche) Add statically configured field values to solrindex-mapping.xml --- Key: NUTCH-1197 URL: https://issues.apache.org/jira/browse/NUTCH-1197 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Fix For: 1.11 Attachments: NUTCH-1197.patch In some cases it's useful to be able to add to every document sent to Solr a set of predefined fields with static values. This could be implemented on the Solr side (with a custom UpdateRequestProcessor), but it may be less cumbersome to add them on the Nutch side. Example: let's say I have several Nutch configurations all indexing to the same Solr instance, and I want each of them to add its identifier as a field in all documents, e.g. origin=web_crawl_1, origin=file_crawl, origin=unlimited_crawl, etc... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata
[ https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1267: - Assignee: (was: Julien Nioche) urlmeta to delegate indexing to index-metadata -- Key: NUTCH-1267 URL: https://issues.apache.org/jira/browse/NUTCH-1267 Project: Nutch Issue Type: Sub-task Components: indexer Affects Versions: 1.6 Reporter: Julien Nioche Fix For: 1.11 Ideally we should get rid of urlmeta altogether and add the transmission of the meta to the outlinks in the core classes - not as a plugin. URLMeta is also a terrible name :-( -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1269) Improve distribution of URLS with multi-segment generation
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1269: - Assignee: (was: Julien Nioche) Improve distribution of URLS with multi-segment generation -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Fix For: 1.11 Attachments: NUTCH-1269-v.2.patch, NUTCH-1269.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1815) Metadata Parsed with parse-tika is Duplicated
[ https://issues.apache.org/jira/browse/NUTCH-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1815: - Assignee: (was: Julien Nioche) Metadata Parsed with parse-tika is Duplicated - Key: NUTCH-1815 URL: https://issues.apache.org/jira/browse/NUTCH-1815 Project: Nutch Issue Type: Bug Components: indexer, parser Affects Versions: 1.8 Reporter: Jonathan Cooper-Ellis Priority: Minor Fix For: 1.11 Attachments: NUTCH-1815-1.9.patch When Nutch is configured to parse metatags and index metadata from HTML documents, disabling parse-html (and using parse-tika instead) causes each metadata field to be indexed twice with identical content. I only modified plugin.includes (description and keywords metatags are included in nutch-site.xml by default, so I did not modify those): property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)/value description.../description /property Sample output: $ bin/nutch indexchecker http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html fetching: http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html parsing: http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html contentType: text/html content : Commonwealth Fund survey: Obamacare helped 9.5 million Americans get health insurance, thanks to exc title : Commonwealth Fund survey: Obamacare helped 9.5 million Americans get health insurance, thanks to exc host :www.bizjournals.com tstamp : Thu Jul 10 17:34:56 UTC 2014 metatag.description : A new survey by the Commonwealth Fund found that 9.5 million previously uninsured Americans got cove metatag.description : A new survey by the Commonwealth Fund found that 9.5 million previously uninsured Americans got cove url : http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans- In this case, metatag.description appears twice. If parse-html is added back to plugin.includes and the same command is run, metatag.description will only appear once. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] Release Apache Nutch 1.10
Hi Folks, So I've moved all remaining issues to Nutch 1.11 https://issues.apache.org/jira/browse/NUTCH/?selectedTab=com.atlassian.jira.jira-projects-plugin:roadmap-panel Not sure if this is what we want the next trunk versioning to look like, however it is an OK placeholder for the time being I hope. If folks would like to see some particular patch make it in to trunk before 1.10 then by all means please re-assign it to 1.10 and we can review and get them in. Thanks very much folks. Lewis On Wed, Jan 28, 2015 at 9:41 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Folks, https://issues.apache.org/jira/browse/NUTCH/?selectedTab=com.atlassian.jira.jira-projects-plugin:roadmap-panel 52 of 211 issues assigned against 1.10 looks pretty good to me and I would be +1 for pushing a release. Does anyone want to get any tickets in there? Does anyone have objections to releasing? Thanks Lewis -- *Lewis* -- *Lewis*
[jira] [Updated] (NUTCH-477) Extend URLFilters to support different filtering chains
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-477: Assignee: (was: Julien Nioche) Extend URLFilters to support different filtering chains --- Key: NUTCH-477 URL: https://issues.apache.org/jira/browse/NUTCH-477 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Reporter: Andrzej Bialecki Priority: Minor Fix For: 1.11 Attachments: urlfilters.patch I propose to make the following changes to URLFilters: * extend URLFilters so that they support different filtering rules depending on the context where they are executed. This functionality mirrors the one that URLNormalizers already support. * change their return value to an int code, in order to support early termination of long filtering chains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Option to disable Robots Rule checking
Hi, On Thu, Jan 29, 2015 at 12:49 AM, dev-digest-h...@nutch.apache.org wrote: Ok, these are valid use cases. They have in common that the Nutch user owns the crawled servers or is (hopefully) explicitly allowed to perform the security research. Another example would be a backend storage migration for all crawl data for one or more DNS. I've done migrations for clients before and being able to override robots.txt in order to get this done in a timely fashion has been mutually beneficial. So you are absolutely right here Seb :) What about an option (or config file) to exclude explicitly a list of hosts (or IPs) from robots.txt parsing? Like a whilelist. Say if I know the IP(s) or hosts I want to override robots.txt for (this can be easily obtained by turning on store.ip.address property) then I could write the hosts and IPs to a flat file which would then be overridden. Is this what you are suggesting? That would require more effort to configure than a boolean property but because it's explicit, it prevents users from disabling robots.txt in general and also guarantees that the security research is not accidentally extended And possibly this would be activated by a boolean property e.g. use.robots.override.whitelist? In all honesty Sebb, I think that this sounds a better compromise as you said, it is explicit. It is still pretty easy to configure right enough. All you need to do is use parsechecker for example, log the IP, add it to new configuration file, then override. It seems good. It actually reminds me of one f the very first patches I tried to take on... which is still open OMFG https://issues.apache.org/jira/browse/NUTCH-208 I need to sort this patch out and commit... 4 years is a terrible duration of time to have left that one hanging!
[jira] [Updated] (NUTCH-685) Content-level redirect status lost in ParseSegment
[ https://issues.apache.org/jira/browse/NUTCH-685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-685: Assignee: (was: Julien Nioche) Content-level redirect status lost in ParseSegment -- Key: NUTCH-685 URL: https://issues.apache.org/jira/browse/NUTCH-685 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Fix For: 1.11 When Fetcher runs in parsing mode, content-level redirects (HTML meta tag Refresh) are properly discovered and recorded in crawl_fetch under source URL and target URL. If Fetcher runs in non-parsing mode, and ParseSegment is run as a separate step, the content-level redirection data is used only to add the new (target) URL, but the status of the original URL is not reset to indicate a redirect. Consequently, status of the original URL will be different depending on the way you run Fetcher, whereas it should be the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches
[ https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1922: Fix Version/s: 2.4 DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches Key: NUTCH-1922 URL: https://issues.apache.org/jira/browse/NUTCH-1922 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Reporter: Gerhard Gossen Fix For: 2.4 Attachments: NUTCH-1922.patch When Nutch 2 finds a link to a URL that was crawled in a previous batch, it resets the fetch status of that URL to {{unfetched}}. This makes this URL available for a re-fetch, even if its crawl interval is not yet over. To reproduce, using version 2.3: {code} # Nutch configuration ant runtime cd runtime/local mkdir seeds echo http://www.l3s.de/~gossen/nutch/a.html seeds/1.txt bin/crawl seeds test 2 {code} This uses two files {{a.html}} and {{b.html}} that link to each other. In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. This should update the score and link fields of {{a.html}}, but not the fetch status. However, when I run {{bin/nutch readdb -crawlId test -url http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns {{status: 1 (status_unfetched)}}. Expected would be {{status: 2 (status_fetched)}}. The reason seems to be that DbUpdateReducer assumes that [links to a URL not processed in the same batch always belong to new pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109]. Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate job, but that change skipped all pages with a different batch ID, so I assume that this introduced this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: Option to disable Robots Rule checking
I am happy with is alternative! :) -Original message- From:Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov Sent: Thursday 29th January 2015 1:21 To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: Option to disable Robots Rule checking Seb I like this idea what do you think Lewis and Markus. Thanks that would help me and my use case Sent from my iPhone On Jan 28, 2015, at 3:17 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi Markus, hi Chris, hi Lewis, -1 from me A well-documented property is just an invitation to disable robots rules. A hidden property is also no alternative because it will be soon documented in our mailing lists or somewhere on the web. And shall we really remove or reformulate Our software obeys the robots.txt exclusion standard on http://nutch.apache.org/bot.html ? Since the agent string sent in the HTTP request always contains /Nutch-x.x (it would require also a patch to change it) I wouldn't make it too easy to make Nutch ignore robots.txt. As you already stated too, we have properties in Nutch that can turn Nutch into a DDOS crawler with or without robots.txt rule parsing. We set these properties to *sensible defaults*. If the robots.txt is obeyed web masters can even prevent this by adding a Crawl-delay rule to their robots.txt. (from Chris): but there are also good [security research] uses of as well (from Lewis): I've met many web admins recently that want to search and index their entire DNS but do not wish to disable their robots.txt filter in order to do so. Ok, these are valid use cases. They have in common that the Nutch user owns the crawled servers or is (hopefully) explicitly allowed to perform the security research. What about an option (or config file) to exclude explicitly a list of hosts (or IPs) from robots.txt parsing? That would require more effort to configure than a boolean property but because it's explicit, it prevents users from disabling robots.txt in general and also guarantees that the security research is not accidentally extended. Cheers, Sebastian On 01/28/2015 01:54 AM, Mattmann, Chris A (3980) wrote: Hi Markus, Thanks for chiming in. I’m reading the below and I see you agree that it should be configurable, but you state that because Nutch is an Apache project, you dismiss the configuration option. What about it being an Apache project makes it any less ethical to simply have a configurable option that is turned off by default that allows the Robot rules to be disabled? For full disclosure I am looking into re-creating DDOS and other attacks doing some security research and so I have valid use cases here for wanting to do so. You state it’s easy to patch Nutch (you are correct for that matter, it’s a 2 line patch to Fetcher.java to disable the RobotRules check. However, how is it any less easy to have a 1 line patch that someone would have to apply to *override* the *default* behavior I’m suggesting of RobotRules being on in nutch-default.xml? So what I’m stating literally in code is: 1. adding a property like nutch.robots.rules.parser and setting it’s default value to true, which enables the robot rules parser, putting this property say even at the bottom of nutch-default.xml and stating that improper use of this property in regular situations of whole web crawls can really hurt your crawling of a site. 2. Having a check in Fetcher.java that checks for this property, if it’s on, default behavior, if it’s off, skips the check. The benefit being you don’t encourage people like me (and lots of others that I’ve talked to) who would like to use Nutch for some security research for crawling to simply go fork it for a 1 line code change. Really? Is that what you want to encourage? The really negative part about that is that it will encourage me to simply use that forked version. I could maintain a patch file, and apply that, but it’s going to fall out of date with updates to Nutch, I’m going to have to update that patch file if nutch-default.xml changes (and so will other people, etc.) As you already stated too, we have properties in Nutch that can turn Nutch into a DDOS crawler with or without robots.txt rule parsing. We set these properties to *sensible defaults*. I’m proposing a compromise that helps people like me; encourages me to keep using Nutch through simplification; and is no less worse that the few other properties that we already expose in Nutch configuration to allow it to be turned into a DDOS bot (which by the way, there are bad uses of, but there are also good [security research] uses of as well, to prevent the bad guys). I appreciate it if you made it this far and hope you will reconsider. Cheers, Chris
Re: Blog topic: Maxmind's GeoIP2 API being used in Apache Nutch 1.10
Hi Susan, Just acknowledging this email. I will write this up during my lunch hour today. Thanks lewis On Thu, Jan 29, 2015 at 6:36 AM, Susan Fendrock sfendr...@maxmind.com wrote: Hello Lewis! Thanks for getting in touch with us about potentially providing a contribution to our blog. Could you provide a brief summary of the blog post you are envisioning? Look forward to learning more about your project, Susan -- Susan Fendrock Product Marketing MaxMind, Inc. 617-500-4493 ext. 820 -- *Lewis*
[jira] [Assigned] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-1927: Assignee: Chris A. Mattmann Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing --- Key: NUTCH-1927 URL: https://issues.apache.org/jira/browse/NUTCH-1927 Project: Nutch Issue Type: Bug Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Based on discussion on the dev list, to use Nutch for some security research valid use cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist: {code:xml} property namerobot.rules.whitelist/name value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value descriptionComma separated list of hostnames or IP addresses to ignore robot rules parsing for. /description /property {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1927 started by Chris A. Mattmann. Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing --- Key: NUTCH-1927 URL: https://issues.apache.org/jira/browse/NUTCH-1927 Project: Nutch Issue Type: Bug Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Based on discussion on the dev list, to use Nutch for some security research valid use cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist: {code:xml} property namerobot.rules.whitelist/name value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value descriptionComma separated list of hostnames or IP addresses to ignore robot rules parsing for. /description /property {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM
[ https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297701#comment-14297701 ] Sebastian Nagel commented on NUTCH-1918: +1 TikaParser specifies a default namespace when generating DOM Key: NUTCH-1918 URL: https://issues.apache.org/jira/browse/NUTCH-1918 Project: Nutch Issue Type: Bug Components: parser Reporter: Julien Nioche Fix For: 1.10 Attachments: NUTCH-1918.patch The DOM generated by parse-tika differs from the one done by parse-html. Ideally we should be able to use either parsers with the same XPath expressions. This is related to [NUTCH-1592], but this time instead of being a matter of uppercases, the problem comes from the namespace used. This issue has been investigated and fixed in storm-crawler [https://github.com/DigitalPebble/storm-crawler/pull/58]. Here is what Guillaume explained there : bq. When parsing the content, Tika creates a properly formatted XHTML document: all elements are created within the namespace XHTML. bq. However in XPath 1.0, there's no concept of default namespace so XPath expressions such as //BODY doesn't match anything. To make this work we should use //ns1:BODY and define a NamespaceContext which associates ns1 with http://www.w3.org/1999/xhtml; bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is our SaxHandler used to convert the SAX Events into a DOM tree to ignore a default name space and the ParserBolt initializes it with the XHTML namespace. This way //BODY matches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
Chris A. Mattmann created NUTCH-1927: Summary: Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing Key: NUTCH-1927 URL: https://issues.apache.org/jira/browse/NUTCH-1927 Project: Nutch Issue Type: Bug Reporter: Chris A. Mattmann Based on discussion on the dev list, to use Nutch for some security research valid use cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist: {code:xml} property namerobot.rules.whitelist/name value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value descriptionComma separated list of hostnames or IP addresses to ignore robot rules parsing for. /description /property {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work stopped] (NUTCH-1924) Nutch + HBase Docker
[ https://issues.apache.org/jira/browse/NUTCH-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1924 stopped by Radosław Stankiewicz. --- Nutch + HBase Docker Key: NUTCH-1924 URL: https://issues.apache.org/jira/browse/NUTCH-1924 Project: Nutch Issue Type: Sub-task Components: build Reporter: Lewis John McGibbney Assignee: Radosław Stankiewicz Fix For: 2.4 ZooKeeper 3.4.5 Hadoop 0.20.204 HBase 0.90.4 Nutch 2.2.1 https://registry.hub.docker.com/u/stankiewicz/hbase_hadoop_nutch/dockerfile/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1924) Nutch + HBase Docker
[ https://issues.apache.org/jira/browse/NUTCH-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297888#comment-14297888 ] Lewis John McGibbney commented on NUTCH-1924: - hi [~rrydziu] Nutch currently uses SVN for SCM. You can checkout the source [here|http://svn.apache.org/repos/asf/nutch/branches/2.x/] You can merely add you contribution there and then attach a patch here {code} svn co http://svn.apache.org/repos/asf/nutch/branches/2.x/ cd 2.x cp -r /path/to/docker_container 2.x/docker/hbase svn add 2.x/docker/hbase svn diff NUTCH-1924.patch {code} Great work so far. We are getting close to a 0.6 release of Gora which would mean that we can upgrade HBase to 0.98.X version. Nutch + HBase Docker Key: NUTCH-1924 URL: https://issues.apache.org/jira/browse/NUTCH-1924 Project: Nutch Issue Type: Sub-task Components: build Reporter: Lewis John McGibbney Assignee: Radosław Stankiewicz Fix For: 2.4 ZooKeeper 3.4.5 Hadoop 0.20.204 HBase 0.90.4 Nutch 2.2.1 https://registry.hub.docker.com/u/stankiewicz/hbase_hadoop_nutch/dockerfile/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)