[jira] [Resolved] (NUTCH-1962) Need to have mimetype-filter.txt file available by default
[ https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Luis Betancourt Gonzalez resolved NUTCH-1962. --- Resolution: Fixed Need to have mimetype-filter.txt file available by default -- Key: NUTCH-1962 URL: https://issues.apache.org/jira/browse/NUTCH-1962 Project: Nutch Issue Type: Improvement Components: plugin Reporter: Lewis John McGibbney Fix For: 1.10 Attachments: NUTCH-1962.patch By default the mimetype-filter.txt file quoted within nutch-default.xml is not available. We need to provide this as it is a PITA to constantly have to add it it new crawler configurations. https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372572#comment-14372572 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-1958: --- +1 Remove scoring-opic from nutch-default.xml -- Key: NUTCH-1958 URL: https://issues.apache.org/jira/browse/NUTCH-1958 Project: Nutch Issue Type: Improvement Affects Versions: 2.3, 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 2.4, 1.10 I propose we remove scoring-opic from nutch-default. We all know it is flawed for any kind of incremental crawl, which most of us do. It is also useless if you want to perform a single crawl, if you must crawl all records of a domain, using OPIC for prioritizing URLS makes no sense. It also confuses users as we have seen in the past and recently [1]. What do you think? [1]: http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1941) Optional rolling http.agent.name's
[ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asitang Mishra updated NUTCH-1941: -- Attachment: NUTCH-1941-ITR2.patch Added: NUTCH-1941-ITR2.patch This patch makes changes to the HttpBase class( A single instance of this is used by different fetcher threads. So, have made the getter of the agentname synchronized) The function rotateAgentName rotates the name of the agent every x urls fetched. The value of x is determined randomly between 1 and 50 (can use a different value here). The list of names to rotate from come from a file agent.txt which should be kept in the nutch/runtime/local folder in your nutch installation. Each line in this file should contain an agent name. Optional rolling http.agent.name's -- Key: NUTCH-1941 URL: https://issues.apache.org/jira/browse/NUTCH-1941 Project: Nutch Issue Type: New Feature Components: fetcher, protocol Reporter: Lewis John McGibbney Priority: Trivial Attachments: NUTCH-1941-ITR2.patch, NUTCH-1941-ver1.patch, agent.names.txt, nutch.patch In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins can block your fetcher based merely on your crawler name. I propose the ability to implement rolling http.agent.name's which could be substituted every 5 seconds for example. This would mean that successive requests to the same domain would be sent with different http.agent.name. This behavior should be off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
TestGDALParser.testParseBasicInfo and TestGDALParser.testParseMetadata errors
Hi everyone, While installing TIKA, I am getting the following error: Tests run: 3, Failures: 2, Errors: 0, Skipped: 1, Time elapsed: 0.209 sec FAILURE! - in org.apache.tika.parser.gdal.TestGDALParser testParseBasicInfo(org.apache.tika.parser.gdal.TestGDALParser) Time elapsed: 0.118 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertNotNull(Assert.java:621) at org.junit.Assert.assertNotNull(Assert.java:631) * at org.apache.tika.parser.gdal.TestGDALParser.testParseBasicInfo(TestGDALParser.java:70)* testParseMetadata(org.apache.tika.parser.gdal.TestGDALParser) Time elapsed: 0.062 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertNotNull(Assert.java:621) at org.junit.Assert.assertNotNull(Assert.java:631) *at org.apache.tika.parser.gdal.TestGDALParser.testParseMetadata(TestGDALParser.java:111)* Just to clarify, this error is not the same as testParseFITS(org.apache.tika.parser.gdal.TestGDALParser) Time elapsed: 0.206 sec FAILURE! java.lang.AssertionError at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertNotNull(Assert.java:621) at org.junit.Assert.assertNotNull(Assert.java:631) * at org.apache.tika.parser.gdal.TestGDALParser.testParseFITS(TestGDALParser.java:153)* which was rectified by tpalsulich in Revision 1647742. Any guidance/help would be appreciated. Thanks, Anvesha -- Graduate Student (MS in Computer Science) University of Southern California *Phone: (+1) 213-308-9002*