Julien Nioche created NUTCH-1729:
Summary: Upgrade to Tika 1.5
Key: NUTCH-1729
URL: https://issues.apache.org/jira/browse/NUTCH-1729
Project: Nutch
Issue Type: Task
Components:
[
https://issues.apache.org/jira/browse/NUTCH-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated NUTCH-1729:
-
Attachment: NUTCH-1729-2.x.patch
patch for 2.x
Upgrade to Tika 1.5
---
Hi Sebastian,
Developing a seperate job is a good idea. With this approach, we can
also collect info about non-HTML documents. Moreover, a job approach
will also allow us to collect info about historically crawled pages. And
as you said, we do not have to store the info in WebPage.
However,
[
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907160#comment-13907160
]
Sebastian Nagel commented on NUTCH-1113:
Hi [~markus17], tried test data from
See https://builds.apache.org/job/Nutch-trunk/2536/
--
[...truncated 1226 lines...]
A src/plugin/lib-http/src/java/org/apache/nutch
A src/plugin/lib-http/src/java/org/apache/nutch/protocol
A
5 matches
Mail list logo