[jira] [Resolved] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez resolved NUTCH-1962.
---
Resolution: Fixed

 Need to have mimetype-filter.txt file available by default
 --

 Key: NUTCH-1962
 URL: https://issues.apache.org/jira/browse/NUTCH-1962
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Lewis John McGibbney
 Fix For: 1.10

 Attachments: NUTCH-1962.patch


 By default the mimetype-filter.txt file quoted within nutch-default.xml is 
 not available. We need to provide this as it is a PITA to constantly have to 
 add it it new crawler configurations.
 https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372572#comment-14372572
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1958:
---

+1 

 Remove scoring-opic from nutch-default.xml
 --

 Key: NUTCH-1958
 URL: https://issues.apache.org/jira/browse/NUTCH-1958
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3, 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.4, 1.10


 I propose we remove scoring-opic from nutch-default. We all know it is flawed 
 for any kind of incremental crawl, which most of us do. It is also useless if 
 you want to perform a single crawl, if you must crawl all records of a 
 domain, using OPIC for prioritizing URLS makes no sense. It also confuses 
 users as we have seen in the past and recently [1].
 What do you think?
 [1]: 
 http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-21 Thread Asitang Mishra (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asitang Mishra updated NUTCH-1941:
--
Attachment: NUTCH-1941-ITR2.patch

Added: NUTCH-1941-ITR2.patch
This patch makes changes to the HttpBase class( A single instance of this is 
used by different fetcher threads. So, have made the getter of the agentname 
synchronized)
The function rotateAgentName rotates the name of the agent every x urls 
fetched. The value of x is determined randomly between 1 and 50 (can use a 
different value here).
The list of names to rotate from come from a file agent.txt which should be 
kept in the nutch/runtime/local folder in your nutch installation.
Each line in this file should contain an agent name. 

 Optional rolling http.agent.name's
 --

 Key: NUTCH-1941
 URL: https://issues.apache.org/jira/browse/NUTCH-1941
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, protocol
Reporter: Lewis John McGibbney
Priority: Trivial
 Attachments: NUTCH-1941-ITR2.patch, NUTCH-1941-ver1.patch, 
 agent.names.txt, nutch.patch


 In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins 
 can block your fetcher based merely on your crawler name. 
 I propose the ability to implement rolling http.agent.name's which could be 
 substituted every 5 seconds for example. This would mean that successive 
 requests to the same domain would be sent with different http.agent.name. 
 This behavior should be off by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


TestGDALParser.testParseBasicInfo and TestGDALParser.testParseMetadata errors

2015-03-21 Thread Anvesha Sinha
Hi everyone,

While installing TIKA, I am getting the following error:

Tests run: 3, Failures: 2, Errors: 0, Skipped: 1, Time elapsed: 0.209 sec
 FAILURE! - in org.apache.tika.parser.gdal.TestGDALParser
testParseBasicInfo(org.apache.tika.parser.gdal.TestGDALParser)  Time
elapsed: 0.118 sec   FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at org.junit.Assert.assertNotNull(Assert.java:631)

*   at
org.apache.tika.parser.gdal.TestGDALParser.testParseBasicInfo(TestGDALParser.java:70)*
testParseMetadata(org.apache.tika.parser.gdal.TestGDALParser)  Time
elapsed: 0.062 sec   FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at org.junit.Assert.assertNotNull(Assert.java:631)

*at
org.apache.tika.parser.gdal.TestGDALParser.testParseMetadata(TestGDALParser.java:111)*

Just to clarify, this error is not the same as

testParseFITS(org.apache.tika.parser.gdal.TestGDALParser)  Time elapsed:
0.206 sec   FAILURE!
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at org.junit.Assert.assertNotNull(Assert.java:631)

*  at
org.apache.tika.parser.gdal.TestGDALParser.testParseFITS(TestGDALParser.java:153)*
which was rectified by tpalsulich in Revision 1647742. Any guidance/help
would be appreciated.

Thanks,
Anvesha
-- 
Graduate Student (MS in Computer Science)
University of Southern California
*Phone: (+1) 213-308-9002*