date:20150321

[jira] [Resolved] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez resolved NUTCH-1962.
---
Resolution: Fixed

 Need to have mimetype-filter.txt file available by default
 --

 Key: NUTCH-1962
 URL: https://issues.apache.org/jira/browse/NUTCH-1962
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Lewis John McGibbney
 Fix For: 1.10

 Attachments: NUTCH-1962.patch


 By default the mimetype-filter.txt file quoted within nutch-default.xml is 
 not available. We need to provide this as it is a PITA to constantly have to 
 add it it new crawler configurations.
 https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372572#comment-14372572
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1958:
---

+1 

 Remove scoring-opic from nutch-default.xml
 --

 Key: NUTCH-1958
 URL: https://issues.apache.org/jira/browse/NUTCH-1958
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3, 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.4, 1.10


 I propose we remove scoring-opic from nutch-default. We all know it is flawed 
 for any kind of incremental crawl, which most of us do. It is also useless if 
 you want to perform a single crawl, if you must crawl all records of a 
 domain, using OPIC for prioritizing URLS makes no sense. It also confuses 
 users as we have seen in the past and recently [1].
 What do you think?
 [1]: 
 http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1941) Optional rolling http.agent.name's

2015-03-21 Thread Asitang Mishra (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Asitang Mishra updated NUTCH-1941:
--
Attachment: NUTCH-1941-ITR2.patch

Added: NUTCH-1941-ITR2.patch
This patch makes changes to the HttpBase class( A single instance of this is
used by different fetcher threads. So, have made the getter of the agentname
synchronized)
The function rotateAgentName rotates the name of the agent every x urls
fetched. The value of x is determined randomly between 1 and 50 (can use a
different value here).
The list of names to rotate from come from a file agent.txt which should be
kept in the nutch/runtime/local folder in your nutch installation.
Each line in this file should contain an agent name.

Optional rolling http.agent.name's
--

Key: NUTCH-1941
URL: https://issues.apache.org/jira/browse/NUTCH-1941
Project: Nutch
Issue Type: New Feature
Components: fetcher, protocol
Reporter: Lewis John McGibbney
Priority: Trivial
Attachments: NUTCH-1941-ITR2.patch, NUTCH-1941-ver1.patch,
agent.names.txt, nutch.patch

In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins
can block your fetcher based merely on your crawler name.
I propose the ability to implement rolling http.agent.name's which could be
substituted every 5 seconds for example. This would mean that successive
requests to the same domain would be sent with different http.agent.name.
This behavior should be off by default.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

TestGDALParser.testParseBasicInfo and TestGDALParser.testParseMetadata errors

2015-03-21 Thread Anvesha Sinha

Hi everyone,

While installing TIKA, I am getting the following error:

Tests run: 3, Failures: 2, Errors: 0, Skipped: 1, Time elapsed: 0.209 sec
 FAILURE! - in org.apache.tika.parser.gdal.TestGDALParser
testParseBasicInfo(org.apache.tika.parser.gdal.TestGDALParser)  Time
elapsed: 0.118 sec   FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at org.junit.Assert.assertNotNull(Assert.java:631)

*   at
org.apache.tika.parser.gdal.TestGDALParser.testParseBasicInfo(TestGDALParser.java:70)*
testParseMetadata(org.apache.tika.parser.gdal.TestGDALParser)  Time
elapsed: 0.062 sec   FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at org.junit.Assert.assertNotNull(Assert.java:631)

*at
org.apache.tika.parser.gdal.TestGDALParser.testParseMetadata(TestGDALParser.java:111)*

Just to clarify, this error is not the same as

testParseFITS(org.apache.tika.parser.gdal.TestGDALParser)  Time elapsed:
0.206 sec   FAILURE!
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at org.junit.Assert.assertNotNull(Assert.java:631)

*  at
org.apache.tika.parser.gdal.TestGDALParser.testParseFITS(TestGDALParser.java:153)*
which was rectified by tpalsulich in Revision 1647742. Any guidance/help
would be appreciated.

Thanks,
Anvesha
-- 
Graduate Student (MS in Computer Science)
University of Southern California
*Phone: (+1) 213-308-9002*

[jira] [Resolved] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

[jira] [Updated] (NUTCH-1941) Optional rolling http.agent.name's

TestGDALParser.testParseBasicInfo and TestGDALParser.testParseMetadata errors

4 matches

Site Navigation

Mail list logo

Footer information