[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045219#comment-14045219 ]
Matthias Krueger commented on TIKA-1332: ---------------------------------------- It might be good to distinguish between the regression testing aspect of nightly runs and the "extraction gap discovery" aspect of running Tika against a large batch of previously untested docs. For regression testing it would be good to generate stats on a run and compare them with the last known "good" stats. These stats could include: * Number/distribution of detected mime types * Number of thrown exceptions thrown per type of exception * Frequencies of metadata key-value pairs * Frequencies of different word lengths extracted from content (per file type) This could be run unsupervised with the delta to the last known "good" run summarized in a daily report. Deeper analysis of extracted metadata and content (as in 2 and 3 of Tim's cases) sounds more like "gap discovery" which I guess would always need some supervision. > Create "eval" code > ------------------ > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server > Reporter: Tim Allison > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.2#6252)