[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14176892#comment-14176892
 ] 

Andrew Jackson commented on TIKA-1302:
--------------------------------------

At the UK Web Archive we run Apache Tika over all our collections (it's been 
run over about 4 billion resources so far). We record the results in Apache 
Solr, to act as a search facet, and we also collect the Exceptions that are 
thrown when Tika fails. We can't make the content available to you directly, 
but perhaps there are datasets we can produce that would be useful to you? e.g. 
would a list of the exceptions that we've seen (along with the URL to the 
resource that caused the exception) be of interest?

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to