[
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209757#comment-14209757
]
Andrew Jackson edited comment on TIKA-1302 at 11/13/14 1:42 PM:
----------------------------------------------------------------
[[email protected]] I've created a download folder on our own site, and
included a dump of about 1/8th of the SAX errors, here:
http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/
Looking through the SAX exceptions, they do seem to be from resources that are
identified as XML (application/\*xml) by Tika. i.e. the exceptions do *not*
seem to be coming from malformed HTML, which is consistent with the standard
Tika configuration you described above (which I can confirm is what we ran
with).
Unfortunately, I can't recover the full stack traces from that run, and it's
not clear if we'll be able to do that in the future because of the way we're
doing the indexing, but we'll look at it and hopefully be able to record the
full error in the future. For now, you'll have to re-run the source item
through Tika to reproduce the error - sorry about that.
was (Author: anjackson):
[[email protected]] I've created a download folder on our own site, and
included a dump of about 1/8th of the SAX errors, here:
http://www.webarchive.org.uk/datasets/ukwa.ds.2/for-tika/
Looking through the SAX exceptions, they do seem to be from resources that are
identified as XML (application/*xml) by Tika. i.e. the exceptions do *not* seem
to be coming from malformed HTML, which is consistent with the standard Tika
configuration you described above (which I can confirm is what we ran with).
Unfortunately, I can't recover the full stack traces from that run, and it's
not clear if we'll be able to do that in the future because of the way we're
doing the indexing, but we'll look at it and hopefully be able to record the
full error in the future. For now, you'll have to re-run the source item
through Tika to reproduce the error - sorry about that.
> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
> Issue Type: Improvement
> Components: cli, general, server
> Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and
> running again, it might be fun to run Tika regularly against a large set of
> docs and report metrics.
> One excellent candidate corpus is govdocs1:
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?
> [~willp-bl], have anything handy you'd like to contribute?
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
> ;)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)