[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870408#comment-15870408 ] Tim Allison commented on TIKA-1332: --- Thank you for the feedback. I agree. Lucene is now downgraded to 5.x. Will wait for a clean build to resolvethis time. > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 2.0, 1.15 > > Attachments: comparison_reports.xml > > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870398#comment-15870398 ] Nick Burch commented on TIKA-1332: -- Unless we really need a Lucene 6 feature, for now to avoid surprises / confusion, I'd suggest rolling back to Lucene 5.x > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 2.0, 1.15 > > Attachments: comparison_reports.xml > > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870392#comment-15870392 ] Tim Allison commented on TIKA-1332: --- Rolled back to Lucene 5.5.3 for now. > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 2.0, 1.15 > > Attachments: comparison_reports.xml > > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870351#comment-15870351 ] Hudson commented on TIKA-1332: -- UNSTABLE: Integrated in Jenkins build Tika-trunk #1198 (See [https://builds.apache.org/job/Tika-trunk/1198/]) TIKA-1332 -- initial commit for tika-eval module. More work remains. (tallison: rev aa7a0c353362d56cb1b8e77297f0807626b0246c) * (add) tika-eval/src/test/java/org/apache/tika/eval/util/MimeUtilTest.java * (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/ContrastStatistics.java * (add) tika-eval/src/test/resources/single-file-profiler-crawl-input-config.xml * (add) tika-eval/src/test/resources/test-dirs/raw_input/file3_attachBNotA.doc * (add) tika-eval/src/test/resources/log4j.properties * (add) tika-eval/src/test/resources/test-dirs/extractsA/file2_attachANotB.doc.json * (add) tika-eval/pom.xml * (add) tika-eval/src/test/resources/test-dirs/extractsB/file11_oom.txt.json * (add) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java * (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenStatistics.java * (add) tika-eval/src/test/resources/test-dirs/extractsB/file3_attachBNotA.doc.json * (add) tika-eval/src/main/java/org/apache/tika/eval/batch/DBConsumersManager.java * (add) tika-eval/src/main/java/org/apache/tika/eval/db/AbstractDBBuffer.java * (add) tika-eval/src/main/java/org/apache/tika/eval/db/ColInfo.java * (add) tika-eval/src/test/resources/test-dirs/raw_input/file5_emptyA.pdf * (add) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java * (add) tika-eval/src/test/resources/test-dirs/raw_input/file8_IOEx.pdf * (add) tika-eval/src/test/resources/test-dirs/extractsA/file1.pdf.json * (add) tika-eval/src/main/java/org/apache/tika/eval/db/H2Util.java * (add) tika-eval/src/test/resources/test-dirs/extractsB/file13_attachANotB.doc.txt * (add) tika-eval/src/test/java/org/apache/tika/eval/reports/ResultsReporterTest.java * (add) tika-eval/src/main/java/org/apache/tika/eval/ExtractComparer.java * (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CJKBigramAwareLengthFilterFactory.java * (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenCountPriorityQueue.java * (add) tika-eval/src/test/resources/commontokens/zh-cn * (add) tika-eval/src/main/resources/tika-eval-comparison-config.xml * (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenCounter.java * (add) tika-eval/src/test/resources/test-dirs/extractsA/file12_es.txt.json * (add) tika-eval/src/test/resources/test-dirs/extractsA/file10_permahang.txt.json * (add) tika-eval/src/main/java/org/apache/tika/eval/io/ExtractReader.java * (add) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogReader.java * (add) tika-eval/src/main/java/org/apache/tika/eval/reports/XLSXHREFFormatter.java * (add) tika-eval/src/main/resources/META-INF/services/org.apache.lucene.analysis.util.TokenFilterFactory * (add) tika-eval/src/test/resources/single-file-profiler-crawl-extract-config.xml * (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java * (add) tika-eval/src/main/java/org/apache/tika/eval/db/DBUtil.java * (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenIntPair.java * (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java * (add) tika-eval/src/main/java/org/apache/tika/eval/io/XMLLogMsgHandler.java * (add) tika-eval/src/test/resources/test-dirs/extractsA/file7_badJson.pdf.json * (add) tika-eval/src/test/resources/test-dirs/raw_input/file2_attachANotB.doc * (add) tika-eval/src/test/java/org/apache/tika/eval/ProfilerBatchTest.java * (add) tika-eval/src/test/resources/commontokens/en * (add) tika-eval/src/test/resources/test-dirs/extractsA/file4_emptyB.pdf.json * (add) tika-eval/src/test/java/org/apache/tika/eval/ComparerBatchTest.java * (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerDeserializer.java * (add) tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java * (add) tika-eval/src/test/resources/test-dirs/batch-logs/batch-process-fatal.xml * (add) tika-eval/src/main/resources/tika-eval-profiler-config.xml * (add) tika-eval/src/test/resources/test-dirs/raw_input/file1.pdf * (add) tika-eval/src/test/resources/test-dirs/raw_input/file11_oom.txt * (add) tika-eval/src/main/java/org/apache/tika/eval/db/TableInfo.java * (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/AnalyzerManager.java * (add) tika-eval/src/test/resources/commontokens/zh-tw * (add) tika-eval/src/test/resources/commontokens/es * (add) tika-eval/src/test/resources/test-dirs/raw_input/file9_noextract.txt * (add) tika-eval/src/main/java/org/apache/tika/eval/EvalFilePaths.java * (add) tika-eval/src/test/java/org/apache/tika/eval/db/AbstractBufferTest.java * (add) tika-eval/src/main/java/org/apache/tika/eval/tokens/TokenContraster.java * (add) tika-eval/
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870321#comment-15870321 ] Tim Allison commented on TIKA-1332: --- Thank you, [~gagravarr]! > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > Attachments: comparison_reports.xml > > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868028#comment-15868028 ] Nick Burch commented on TIKA-1332: -- Apache Ignite seems to use H2, and a google of H2 + apache.org shows quite a few other projects with connectors to it at least. That said, there's also Apache Derby which might cover the same use-case > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > Attachments: comparison_reports.xml > > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15867992#comment-15867992 ] Tim Allison commented on TIKA-1332: --- Are there any licensing objections to adding a dependency in the tika-eval module for the H2 database? This is dual licensed MPL2.0 and EPL 1.0. These are both "weak copyleft" and should be ok if we document them according to https://www.apache.org/legal/resolved#category-b. As a side note, this dependency will only exist for the tika-eval module, not for any of the other modules. > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > Attachments: comparison_reports.xml > > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861665#comment-15861665 ] Tim Allison commented on TIKA-1332: --- Some more work is required, but I think tika-eval is getting close to being ready to commit. If anyone has a chance to review, code is on my [github fork|https://github.com/tballison/tika/tree/TIKA-1302] and the beginnings of wiki documentation are now up on our [wiki|https://wiki.apache.org/tika/TikaEval]. Thank you! > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > Attachments: comparison_reports.xml > > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230287#comment-15230287 ] Tim Allison commented on TIKA-1332: --- I gave up on that, and we're now using httpd. The eval code currently exists as commandline calls. I'm using h2 as the backend database, which appears to be compatible with ASL 2.0. As with all development cycles, I started with a flat file, moved to an unfortunately complex db structure and will probably have to move to nosql if we want this to scale...but not yet... As above, there are two modes. 1) Profile a single run a) run tika-app on a directory of files, output with -J -t (Json representation of List with text as the content) b) run the profiling code, which populates an h2 db c) run xml-configured reports db 2) Compare two runs a) run two versions of tika-app on a directory of files b) run the comparison code, which populates an h2 db c) run xml-configured reports against the db I've pretty much given up on the notion of automatic testing. A human has to look at the reports and make sense of them. Given the feedback I received at ApacheCon (egads, a year ago), I think I'd like to transition this code into Tika for 1.14. When the code is ready for review, I'll let y'all know. Any and all feedback on the reports to date would be great. > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326488#comment-14326488 ] Tim Allison commented on TIKA-1332: --- I got a simple Jetty ResourceHandler up and running on the vm today, but it kept failing on large archive files ~250MB. I set {{idleTimeout}} and {{stopTimeout}} to large values, but still had no luck. Has anyone had luck with Jetty's ResourceHandler for large files. Has anyone had luck with Jetty's ResourceHandler and large files? How about jax-rs for files of that size? I notice that govdocs1 is using httpd. Perhaps we'll want separate static/archive server ports vs. active jax-rs browsing? I plan to start publishing static results of single runs and comparisons of runs over the next few weeks. > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222947#comment-14222947 ] Tim Allison commented on TIKA-1332: --- In a personal communication, I asked [~sergey_beryozkin] for recommendations for handling static content in the jax-rs framework. For the UI component of the eval code -- how the user interacts with the results of the eval -- Is there an easy equivalent in JAX-RS that allows for the user to browse a directory of files and click on desired files for download as easily as one can with Jetty's ResourceHandler. With permission, I'm posting/summarizing [~sergey_beryozkin]'s responses. If anyone else has a recommendation leveraging the JAX-RS framework for dynamic data and still using something so easy as Jetty's ResourceHandler for static content, please let us know. Option 1: Handcode a JAX-RS handler that mimics Jetty's ResourceHandler > That can be easily enough though with JAX-RS if you'd like to explore > this path, something like this I guess: > {noformat} @Path("eval") public class TikaEvaluation { @Context private UriInfo ui; @GET @Path("list") @Produces("text/html") public Response getListOfResultURIs() { List uris = new LinkedList(); for (File f : getResultFiles()) { uris.add(ui.getAbsoluteUriBuilder().path(f.getName()).build()); } // uris list now how a list of links to individual files // next we need to decide how to convert that to HTML // one option is to return the list as is and redirect that to // JSP, another option is to build a basic HTML string right here in the // method, another option is to register a MessageBodyWriter that will // convert the list into HTML // the individual links will be managed by getFile() method return Response.ok(uris).build(); } @GET @Path("list/{name}") @Produces("application/json", "multipart/mixed") public Response getFile(@PathParam("name") String name) { ... } {noformat} Option 2: Run Jetty's ResourceHandler from the same embedded Jetty server that is hosting the JAX-RS code. > This link would probably be the best one: [link| > https://git-wip-us.apache.org/repos/asf?p=cxf.git;a=blob_plain;f=distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Server.java;hb=HEAD] > Tika JAX-RS server actually runs on top of Jetty right now too, but in > this case we have a direct Jetty server setup. > > The server registers a CXF servlet and Jetty handlers too. CXF servlet > also redirect to default handlers like a default handler for serving the > static content. This is not needed if the result files are accessible > over URI that does not overlap with a CXF servlet URI pattern. > In fact, I wonder if a Tika JAXRS style of the registration may also do > ? If you register a CXF endpoint at /eval and the results are accessible > over /results then it should work ? Unless Jetty ContentHandler is not > installed by default - then the linked to code would def do :-) > the only possible downside here is that as far as the consistent URI > space management is concerned we'd have one part of it (the static > resources) controlled natively by Jetty and the rest - by JAX-RS. so it > can be trickier to provide a support for searching the results, > enforcing the common security rules (when/if needed). > That said may be it is not of a real concern, it can always be removed > in the future if needed. Other options? > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045219#comment-14045219 ] Matthias Krueger commented on TIKA-1332: It might be good to distinguish between the regression testing aspect of nightly runs and the "extraction gap discovery" aspect of running Tika against a large batch of previously untested docs. For regression testing it would be good to generate stats on a run and compare them with the last known "good" stats. These stats could include: * Number/distribution of detected mime types * Number of thrown exceptions thrown per type of exception * Frequencies of metadata key-value pairs * Frequencies of different word lengths extracted from content (per file type) This could be run unsupervised with the delta to the last known "good" run summarized in a daily report. Deeper analysis of extracted metadata and content (as in 2 and 3 of Tim's cases) sounds more like "gap discovery" which I guess would always need some supervision. > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044706#comment-14044706 ] Hong-Thai Nguyen commented on TIKA-1332: What you are describing is something alike _functional_ tests for Tika. Kinds of Cucumber, Fitness tools may help tests more readable and we can obtain report at output ? > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044682#comment-14044682 ] Tim Allison commented on TIKA-1332: --- To my mind, there are three families of things that can go wrong: 1) Parser can fail 1a) throw an exception 1b) hang forever 2) Fail to extract text and/or metadata from documents 2a) nothing is extracted 2b) some document components or attachments are not extracted: TIKA-1317 and TIKA-1228 3) Extract junk (mojibake, too many spaces in pdfs, fail to add space btwn runs in .docx, etc), in which case there are two options: 3a) We can do better. 3b) We can't...the document is just plain broken. We can easily count and compare 1). By easily, I mean that I haven't fully worked it out, but it should be fairly straightforward. Without a truth set or a comparison parser, we cannot easily measure 2a or 2b. For 2a, if there is no text, maybe there really is no text (image only pdfs or just a docx that contains images). For 2b, we're really out of luck without other resources. For 3), there's lots of room for work. In short, I'd think we'd want to calculate how "languagey" the extracted text is. Some indicators that occur to me: a) Type/token ratio or token entropy b) Average word length (with an exception for non-whitespace languages) c) Ratio of alphanumerics to total string length d) Analysis of language id confidence scores...if the string is long enough, you'd expect a langid component to return a very high score for the best language and then far lower scores for the 2nd and 3rd best languages. If the langid component returns flat scores, then that might be an indicator that something didn't go well. What do you think? Are there other things that can go wrong? What else should we try to measure, in a supervised (not ideal) or semi-supervised (better) or unsupervised (best)? > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.2#6252)