[
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177054#comment-14177054
]
Tim Allison edited comment on TIKA-1302 at 10/20/14 5:22 PM:
-------------------------------------------------------------
That would be a fantastic resource. Thank you for sharing! We could do a bit
of munging to prioritize most common exceptions in dependencies.
Your 0.1% exception rate is smaller than the 0.7% exception rate I'm finding on
the govdocs1 corpus, but in the same ballpark. Interesting.
Do you know how many permanent hangs you had and can you identify those files
easily enough? I had about 6 in the govdocs1 corpus.
Thank you!
P.S. On the SAXParseExceptions...did those come from the XMLParser or from the
HtmlParser? I recently discovered that we hardcode an override in TikaResource
within tika-server:
{noformat}
parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
{noformat}
Not sure that we should hardcode that, but it does make sense to use that
configuration!
was (Author: [email protected]):
That would be a fantastic resource. Thank you for sharing! We could do a bit
of munging to prioritize most common exceptions in dependencies.
Your 0.1% exception rate is smaller than the 0.7% exception rate I'm finding on
the govdocs1 corpus, but in the same ballpark. Interesting.
Do you know how many permanent hangs you had and can you identify those files
easily enough? I had about 6 in the govdocs1 corpus.
Thank you!
> Let's run Tika against a large batch of docs nightly
> ----------------------------------------------------
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
> Issue Type: Improvement
> Components: cli, general, server
> Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and
> running again, it might be fun to run Tika regularly against a large set of
> docs and report metrics.
> One excellent candidate corpus is govdocs1:
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?
> [~willp-bl], have anything handy you'd like to contribute?
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
> ;)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)