[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045219#comment-14045219
 ] 

Matthias Krueger commented on TIKA-1332:
----------------------------------------

It might be good to distinguish between the regression testing aspect of 
nightly runs and the "extraction gap discovery" aspect of running Tika against 
a large batch of previously untested docs.

For regression testing it would be good to generate stats on a run and compare 
them with the last known "good" stats. These stats could include:
* Number/distribution of detected mime types
* Number of thrown exceptions thrown per type of exception
* Frequencies of metadata key-value pairs
* Frequencies of different word lengths extracted from content (per file type)
This could be run unsupervised with the delta to the last known "good" run 
summarized in a daily report.

Deeper analysis of extracted metadata and content (as in 2 and 3 of Tim's 
cases) sounds more like "gap discovery" which I guess would always need some 
supervision.


> Create "eval" code
> ------------------
>
>                 Key: TIKA-1332
>                 URL: https://issues.apache.org/jira/browse/TIKA-1332
>             Project: Tika
>          Issue Type: Sub-task
>          Components: cli, general, server
>            Reporter: Tim Allison
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to