[jira] [Created] (NUTCH-1729) Upgrade to Tika 1.5

2014-02-20 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1729: Summary: Upgrade to Tika 1.5 Key: NUTCH-1729 URL: https://issues.apache.org/jira/browse/NUTCH-1729 Project: Nutch Issue Type: Task Components:

[jira] [Updated] (NUTCH-1729) Upgrade to Tika 1.5

2014-02-20 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1729: - Attachment: NUTCH-1729-2.x.patch patch for 2.x Upgrade to Tika 1.5 ---

Re: Getting statistics about crawled pages

2014-02-20 Thread Alparslan Avcı
Hi Sebastian, Developing a seperate job is a good idea. With this approach, we can also collect info about non-HTML documents. Moreover, a job approach will also allow us to collect info about historically crawled pages. And as you said, we do not have to store the info in WebPage. However,

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-02-20 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907160#comment-13907160 ] Sebastian Nagel commented on NUTCH-1113: Hi [~markus17], tried test data from

Build failed in Jenkins: Nutch-trunk #2536

2014-02-20 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/2536/ -- [...truncated 1226 lines...] A src/plugin/lib-http/src/java/org/apache/nutch A src/plugin/lib-http/src/java/org/apache/nutch/protocol A