[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209757#comment-14209757
 ] 

Andrew Jackson commented on TIKA-1302:
--------------------------------------

[~talli...@apache.org] I've created a download folder on our own site, and 
included a dump of about 1/8th of the SAX errors, here: 
http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/

Looking through the SAX exceptions, they do seem to be from resources that are 
identified as XML (application/*xml) by Tika. i.e. the exceptions do *not* seem 
to be coming from malformed HTML, which is consistent with the standard Tika 
configuration you described above (which I can confirm is what we ran with).

Unfortunately, I can't recover the full stack traces from that run, and it's 
not clear if we'll be able to do that in the future because of the way we're 
doing the indexing, but we'll look at it and hopefully be able to record the 
full error in the future. For now, you'll have to re-run the source item 
through Tika to reproduce the error - sorry about that.

> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
>                 Key: TIKA-1302
>                 URL: https://issues.apache.org/jira/browse/TIKA-1302
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to