[jira] [Commented] (TIKA-1332) Create "eval" code

2017-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870408#comment-15870408
 ] 

Tim Allison commented on TIKA-1332:
---

Thank you for the feedback.  I agree.  Lucene is now downgraded to 5.x.  

Will wait for a clean build to resolvethis time.

> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 2.0, 1.15
>
> Attachments: comparison_reports.xml
>
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1332) Create "eval" code

2017-02-16 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870398#comment-15870398
 ] 

Nick Burch commented on TIKA-1332:
--

Unless we really need a Lucene 6 feature, for now to avoid surprises / 
confusion, I'd suggest rolling back to Lucene 5.x

> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 2.0, 1.15
>
> Attachments: comparison_reports.xml
>
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1332) Create "eval" code

2017-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870392#comment-15870392
 ] 

Tim Allison commented on TIKA-1332:
---

Rolled back to Lucene 5.5.3 for now.

> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 2.0, 1.15
>
> Attachments: comparison_reports.xml
>
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1332) Create "eval" code

2017-02-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870351#comment-15870351
 ] 

Hudson commented on TIKA-1332:
--

UNSTABLE: Integrated in Jenkins build Tika-trunk #1198 (See 
[https://builds.apache.org/job/Tika-trunk/1198/])
TIKA-1332 -- initial commit for tika-eval module. More work remains. (tallison: 
rev aa7a0c353362d56cb1b8e77297f0807626b0246c)
* (add) tika-eval/src/test/java/org/apache/tika/eval/util/MimeUtilTest.java
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/tokens/ContrastStatistics.java
* (add) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml
* (add) tika-eval/src/test/resources/test-dirs/raw_input/file3_attachBNotA.doc
* (add) tika-eval/src/test/resources/log4j.properties
* (add) 
tika-eval/src/test/resources/test-dirs/extractsA/file2_attachANotB.doc.json
* (add) tika-eval/pom.xml
* (add) tika-eval/src/test/resources/test-dirs/extractsB/file11_oom.txt.json
* (add) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java
* (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenStatistics.java
* (add) 
tika-eval/src/test/resources/test-dirs/extractsB/file3_attachBNotA.doc.json
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/batch/DBConsumersManager.java
* (add) tika-eval/src/main/java/org/apache/tika/eval/db/AbstractDBBuffer.java
* (add) tika-eval/src/main/java/org/apache/tika/eval/db/ColInfo.java
* (add) tika-eval/src/test/resources/test-dirs/raw_input/file5_emptyA.pdf
* (add) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java
* (add) tika-eval/src/test/resources/test-dirs/raw_input/file8_IOEx.pdf
* (add) tika-eval/src/test/resources/test-dirs/extractsA/file1.pdf.json
* (add) tika-eval/src/main/java/org/apache/tika/eval/db/H2Util.java
* (add) 
tika-eval/src/test/resources/test-dirs/extractsB/file13_attachANotB.doc.txt
* (add) 
tika-eval/src/test/java/org/apache/tika/eval/reports/ResultsReporterTest.java
* (add) tika-eval/src/main/java/org/apache/tika/eval/ExtractComparer.java
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/tokens/CJKBigramAwareLengthFilterFactory.java
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenCountPriorityQueue.java
* (add) tika-eval/src/test/resources/commontokens/zh-cn
* (add) tika-eval/src/main/resources/tika-eval-comparison-config.xml
* (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenCounter.java
* (add) tika-eval/src/test/resources/test-dirs/extractsA/file12_es.txt.json
* (add) 
tika-eval/src/test/resources/test-dirs/extractsA/file10_permahang.txt.json
* (add) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java
* (add) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogReader.java
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/reports/XLSXHREFFormatter.java
* (add) 
tika-eval/src/main/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory
* (add) 
tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java
* (add) tika-eval/src/main/java/org/apache/tika/eval/db/DBUtil.java
* (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenIntPair.java
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java
* (add) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogMsgHandler.java
* (add) tika-eval/src/test/resources/test-dirs/extractsA/file7_badJson.pdf.json
* (add) tika-eval/src/test/resources/test-dirs/raw_input/file2_attachANotB.doc
* (add) tika-eval/src/test/java/org/apache/tika/eval/ProfilerBatchTest.java
* (add) tika-eval/src/test/resources/commontokens/en
* (add) tika-eval/src/test/resources/test-dirs/extractsA/file4_emptyB.pdf.json
* (add) tika-eval/src/test/java/org/apache/tika/eval/ComparerBatchTest.java
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerDeserializer.java
* (add) 
tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java
* (add) 
tika-eval/src/test/resources/test-dirs/batch-logs/batch-process-fatal.xml
* (add) tika-eval/src/main/resources/tika-eval-profiler-config.xml
* (add) tika-eval/src/test/resources/test-dirs/raw_input/file1.pdf
* (add) tika-eval/src/test/resources/test-dirs/raw_input/file11_oom.txt
* (add) tika-eval/src/main/java/org/apache/tika/eval/db/TableInfo.java
* (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerManager.java
* (add) tika-eval/src/test/resources/commontokens/zh-tw
* (add) tika-eval/src/test/resources/commontokens/es
* (add) tika-eval/src/test/resources/test-dirs/raw_input/file9_noextract.txt
* (add) tika-eval/src/main/java/org/apache/tika/eval/EvalFilePaths.java
* (add) tika-eval/src/test/java/org/apache/tika/eval/db/AbstractBufferTest.java
* (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenContraster.java
* (add) tika-eval/

[jira] [Commented] (TIKA-1332) Create "eval" code

2017-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870321#comment-15870321
 ] 

Tim Allison commented on TIKA-1332:
---

Thank you, [~gagravarr]!

> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
> Attachments: comparison_reports.xml
>
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1332) Create "eval" code

2017-02-15 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868028#comment-15868028
 ] 

Nick Burch commented on TIKA-1332:
--

Apache Ignite seems to use H2, and a google of H2 + apache.org shows quite a 
few other projects with connectors to it at least.

That said, there's also Apache Derby which might cover the same use-case

> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
> Attachments: comparison_reports.xml
>
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1332) Create "eval" code

2017-02-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15867992#comment-15867992
 ] 

Tim Allison commented on TIKA-1332:
---

Are there any licensing objections to adding a dependency in the tika-eval 
module for the H2 database?  This is dual licensed MPL2.0 and EPL 1.0.  These 
are both "weak copyleft" and should be ok if we document them according to 
https://www.apache.org/legal/resolved#category-b.

As a side note, this dependency will only exist for the tika-eval module, not 
for any of the other modules.

> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
> Attachments: comparison_reports.xml
>
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1332) Create "eval" code

2017-02-10 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861665#comment-15861665
 ] 

Tim Allison commented on TIKA-1332:
---

Some more work is required, but I think tika-eval is getting close to being 
ready to commit.  

If anyone has a chance to review, code is on my [github 
fork|https://github.com/tballison/tika/tree/TIKA-1302] and the beginnings of 
wiki documentation are now up on our 
[wiki|https://wiki.apache.org/tika/TikaEval].

Thank you!

> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
> Attachments: comparison_reports.xml
>
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1332) Create "eval" code

2016-04-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230287#comment-15230287
 ] 

Tim Allison commented on TIKA-1332:
---

I gave up on that, and we're now using httpd.

The eval code currently exists as commandline calls.  I'm using h2 as the 
backend database, which appears to be compatible with ASL 2.0.  As with all 
development cycles, I started with a flat file, moved to an unfortunately 
complex db structure and will probably have to move to nosql if we want this to 
scale...but not yet...

As above, there are two modes.
1) Profile a single run
   a) run tika-app on a directory of files, output with -J -t (Json 
representation of List with text as the content)
   b) run the profiling code, which populates an h2 db
   c) run xml-configured reports db

2) Compare two runs
  a) run two versions of tika-app on a directory of files
  b) run the comparison code, which populates an h2 db
  c) run xml-configured reports against the db

I've pretty much given up on the notion of automatic testing.  A human has to 
look at the reports and make sense of them.

Given the feedback I received at ApacheCon (egads, a year ago), I think I'd 
like to transition this code into Tika for 1.14.

When the code is ready for review, I'll let y'all know.  Any and all feedback 
on the reports to date would be great.


> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1332) Create "eval" code

2015-02-18 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326488#comment-14326488
 ] 

Tim Allison commented on TIKA-1332:
---

I got a simple Jetty ResourceHandler up and running on the vm today, but it 
kept failing on large archive files ~250MB.  I set {{idleTimeout}} and 
{{stopTimeout}} to large values, but still had no luck.  Has anyone had luck 
with Jetty's ResourceHandler for large files.

Has anyone had luck with Jetty's ResourceHandler and large files?  How about 
jax-rs for files of that size?


I notice that govdocs1 is using httpd.  Perhaps we'll want separate 
static/archive server ports vs. active jax-rs browsing?

I plan to start publishing static results of single runs and comparisons of 
runs over the next few weeks.



> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1332) Create "eval" code

2014-11-24 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222947#comment-14222947
 ] 

Tim Allison commented on TIKA-1332:
---

In a personal communication, I asked [~sergey_beryozkin] for recommendations 
for handling static content in the jax-rs framework.  For the UI component of 
the eval code -- how the user interacts with the results of the eval -- Is 
there an easy equivalent in JAX-RS that allows for the user to browse a 
directory of files and click on desired files for download as easily as one can 
with Jetty's ResourceHandler.

With permission, I'm posting/summarizing [~sergey_beryozkin]'s responses.  If 
anyone else has a recommendation leveraging the JAX-RS framework for dynamic 
data and still using something so easy as Jetty's ResourceHandler for static 
content, please let us know.

Option 1: 
Handcode a JAX-RS handler that mimics Jetty's ResourceHandler
> That can be easily enough though with JAX-RS if you'd like to explore
> this path, something like this I guess:
>
{noformat}
 @Path("eval")
 public class TikaEvaluation {
   @Context
   private UriInfo ui;
   @GET
   @Path("list")
   @Produces("text/html")
   public Response getListOfResultURIs() {
   List uris = new LinkedList();
   for (File f : getResultFiles()) {

   uris.add(ui.getAbsoluteUriBuilder().path(f.getName()).build());
  }
   // uris list now how a list of links to individual files
   // next we need to decide how to convert that to HTML
   // one option is to return the list as is and redirect that to
   // JSP, another option is to build a basic HTML string right here in 
the
   // method, another option is to register a MessageBodyWriter that 
will
   // convert the list into HTML
   // the individual links will be managed by getFile() method

   return Response.ok(uris).build();
   }

   @GET
   @Path("list/{name}")
   @Produces("application/json", "multipart/mixed")
   public Response getFile(@PathParam("name") String name) {
   ...
   }

{noformat}

Option 2:
Run Jetty's ResourceHandler from the same embedded Jetty server that is hosting 
the JAX-RS code.
> This link would probably be the best one: [link| 
> https://git-wip-us.apache.org/repos/asf?p=cxf.git;a=blob_plain;f=distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Server.java;hb=HEAD]

> Tika JAX-RS server actually runs on top of Jetty right now too, but in
> this case we have a direct Jetty server setup.
>
> The server registers a CXF servlet and Jetty handlers too. CXF servlet
> also redirect to default handlers like a default handler for serving the
> static content. This is not needed if the result files are accessible
> over URI that does not overlap with a CXF servlet URI pattern.
> In fact, I wonder if a Tika JAXRS style of the registration may also do
> ? If you register a CXF endpoint at /eval and the results are accessible
> over /results then it should  work ? Unless Jetty ContentHandler is not
> installed by default - then the linked to code would def do :-)

> the only possible downside here is that as far as the consistent URI 
> space management is concerned we'd have one part of it (the static 
> resources) controlled natively by Jetty and the rest - by JAX-RS. so it 
> can be trickier to provide a support for searching the results, 
> enforcing the common security rules (when/if needed).
> That said may be it is not of a real concern, it can always be removed 
> in the future if needed.


Other options?


> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1332) Create "eval" code

2014-06-26 Thread Matthias Krueger (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045219#comment-14045219
 ] 

Matthias Krueger commented on TIKA-1332:


It might be good to distinguish between the regression testing aspect of 
nightly runs and the "extraction gap discovery" aspect of running Tika against 
a large batch of previously untested docs.

For regression testing it would be good to generate stats on a run and compare 
them with the last known "good" stats. These stats could include:
* Number/distribution of detected mime types
* Number of thrown exceptions thrown per type of exception
* Frequencies of metadata key-value pairs
* Frequencies of different word lengths extracted from content (per file type)
This could be run unsupervised with the delta to the last known "good" run 
summarized in a daily report.

Deeper analysis of extracted metadata and content (as in 2 and 3 of Tim's 
cases) sounds more like "gap discovery" which I guess would always need some 
supervision.


> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1332) Create "eval" code

2014-06-26 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044706#comment-14044706
 ] 

Hong-Thai Nguyen commented on TIKA-1332:


What you are describing is something alike _functional_ tests for Tika. Kinds 
of Cucumber, Fitness tools may help tests more readable and we can obtain 
report at output ?

> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1332) Create "eval" code

2014-06-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044682#comment-14044682
 ] 

Tim Allison commented on TIKA-1332:
---

To my mind, there are three families of things that can go wrong:

1) Parser can fail
1a) throw an exception
1b) hang forever

2) Fail to extract text and/or metadata from documents
2a) nothing is extracted
2b) some document components or attachments are not extracted: TIKA-1317 
and TIKA-1228

3) Extract junk (mojibake, too many spaces in pdfs, fail to add space btwn runs 
in .docx, etc), in which case there are two options:
  3a) We can do better.
  3b) We can't...the document is just plain broken.

We can easily count and compare 1).   By easily, I mean that I haven't fully 
worked it out, but it should be fairly straightforward.

Without a truth set or a comparison parser, we cannot easily measure 2a or 2b.  
For 2a, if there is no text, maybe there really is no text (image only pdfs or 
just a docx that contains images).  For 2b, we're really out of luck without 
other resources.
  
For 3), there's lots of room for work.  In short, I'd think we'd want to 
calculate how "languagey" the extracted text is.  Some indicators that occur to 
me:

 a) Type/token ratio or token entropy
 b) Average word length (with an exception for non-whitespace languages)
 c) Ratio of alphanumerics to total string length
 d) Analysis of language id confidence scores...if the string is long enough, 
you'd expect a langid component to return a very high score for the best 
language and then far lower scores for the 2nd and 3rd best languages.  If the 
langid component returns flat scores, then that might be an indicator that 
something didn't go well.  

What do you think?  Are there other things that can go wrong?  What else should 
we try to measure, in a supervised (not ideal) or semi-supervised (better) or 
unsupervised (best)? 

> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.2#6252)