Sweet! Please feel free to make any use that you can out of [0].
Y, I’m storing results in a db as well (h2) and using that to dump reports
along the lines of [1]…note I’m using POI to generate xlsx files now ☺.
Is there any way we could collaborate on the eval code? My active dev (when I
have a chance) is on the TIKA-1302 branch of my Tika github fork. The goal is
to eventually contribute that as a tika-eval module.
If you wanted access to our vm, I’d be more than happy to grant access so we
can collaborate on the corpus and the eval stuff.
Oh, as for Common Crawl, as you already know, in addition to the incorrect mime
types, etc…one of the big things that’s been something to be aware of is that
they truncate their files at 1MB, which is a big problem for file formats that
tend to be bigger than that. Are you pulling only non-truncated files?
Again, this is fantastic! What can we share/collaborate on?
Cheers,
Tim
[0]
http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
[1]
https://issues.apache.org/jira/secure/attachment/12782054/reports_pdfbox_1_8_11-rc1.zip
From: Dominik Stadler [mailto:[email protected]]
Sent: Wednesday, January 13, 2016 2:09 PM
To: POI Developers List <[email protected]>
Subject: Using CommonCrawl for POI regression-mass-testing
Hi,
FYI, I am playing with CommonCrawl data for some talk that I plan to do in
2016. As part of this I built a small framework to let me run the POI
integrationtest-framework on a large number of documents that I extracted from
a number of CommonCrawl-runs. This is somewhat similar to what Tim is doing for
Tika, but it focues on POI-related documents.
I tried to use this as a huge regression-check, in this case I compared
relelase 3.13 and 3.14-beta1. In the future I can fairly easily run this
against newer versions to check for any new regressions.
Some statistics:
* Overall I processed 829356 POI-related documents
* 687506 documents did process fine in both versions!
* 140699 documents caused parsing errors in both versions. Many of these are
actually invalid documents, wrong file-types, incorrect mime-types, ... so the
actuall error rate would be much lower, but it is currently not overly useful
to look at these errors without first sorting out all the false-positives.
* 845 documents failed in POI 3.13 and now work in 3.14-beta1, so we made more
documents succeed now, jay!
* And finally 306 documents did fail in POI-3.14-beta1 while they processed
fine with POI-3.13.
However these potential regressions have the following causes:
** aprox 280 of these were caused because we do more checks for HSLF now
** 19 were OOMs that happen in my framework with large documents due to
parallel processing
** One document fails Date-parsing where I don't see how it did work before,
maybe this is also caused by more testing now
** 5 documents failed due to the new support for multi-part formats and locale
id
** One document showed an NPE in HSLFTextParagraph
So only the last two look like actual regressions, I will commit fixes together
with reproducing files for these two shortly.
I store the results into a database, so I can query on the results in various
ways:
E.g. attached is the list of top 100 exception-messages for the failed files.
Let me know if you would like to get a full stacktrace and document for any of
those or if you have suggestions for additional queries/checks that we could
add here!
Dominik.