[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177054#comment-14177054
 ] 

Tim Allison commented on TIKA-1302:
-----------------------------------

That would be a fantastic resource.  Thank you for sharing!  We could do a bit 
of munging to prioritize most common exceptions in dependencies.

Your 0.1% exception rate is smaller than the 0.7% exception rate I'm finding on 
the govdocs1 corpus, but in the same ballpark.  Interesting.

Do you know how many permanent hangs you had and can you identify those files 
easily enough?  I had about 6 in the govdocs1 corpus.

Thank you!

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to