Hi,

FYI, I am playing with CommonCrawl data for some talk that I plan to do in
2016. As part of this I built a small framework to let me run the POI
integrationtest-framework on a large number of documents that I extracted
from a number of CommonCrawl-runs. This is somewhat similar to what Tim is
doing for Tika, but it focues on POI-related documents.

I tried to use this as a huge regression-check, in this case I compared
relelase 3.13 and 3.14-beta1. In the future I can fairly easily run this
against newer versions to check for any new regressions.


Some statistics:

* Overall I processed 829356 POI-related documents

* 687506 documents did process fine in both versions!

* 140699 documents caused parsing errors in both versions. Many of these
are actually invalid documents, wrong file-types, incorrect mime-types, ...
so the actuall error rate would be much lower, but it is currently not
overly useful to look at these errors without first sorting out all the
false-positives.

* 845 documents failed in POI 3.13 and now work in 3.14-beta1, so we made
more documents succeed now, jay!

* And finally 306 documents did fail in POI-3.14-beta1 while they processed
fine with POI-3.13.


However these potential regressions have the following causes:

** aprox 280 of these were caused because we do more checks for HSLF now
** 19 were OOMs that happen in my framework with large documents due to
parallel processing
** One document fails Date-parsing where I don't see how it did work
before, maybe this is also caused by more testing now
** 5 documents failed due to the new support for multi-part formats and
locale id
** One document showed an NPE in HSLFTextParagraph

So only the last two look like actual regressions, I will commit fixes
together with reproducing files for these two shortly.



I store the results into a database, so I can query on the results in
various ways:

E.g. attached is the list of top 100 exception-messages for the failed
files.

Let me know if you would like to get a full stacktrace and document for any
of those or if you have suggestions for additional queries/checks that we
could add here!

Dominik.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to