Sweet!  Please feel free to make any use that you can out of [0].

Y, I’m storing results in a db as well (h2) and using that to dump reports 
along the lines of [1]…note I’m using POI to generate xlsx files now ☺.

Is there any way we could collaborate on the eval code?  My active dev (when I 
have a chance) is on the TIKA-1302 branch of my Tika github fork.  The goal is 
to eventually contribute that as a tika-eval module.

If you wanted access to our vm, I’d be more than happy to grant access so we 
can collaborate on the corpus and the eval stuff.

Oh, as for Common Crawl, as you already know, in addition to the incorrect mime 
types, etc…one of the big things that’s been something to be aware of is that 
they truncate their files at 1MB, which is a big problem for file formats that 
tend to be bigger than that.  Are you pulling only non-truncated files?

Again, this is fantastic!  What can we share/collaborate on?

Cheers,

           Tim


[0] 
http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
[1] 
https://issues.apache.org/jira/secure/attachment/12782054/reports_pdfbox_1_8_11-rc1.zip

From: Dominik Stadler [mailto:[email protected]]
Sent: Wednesday, January 13, 2016 2:09 PM
To: POI Developers List <[email protected]>
Subject: Using CommonCrawl for POI regression-mass-testing

Hi,
FYI, I am playing with CommonCrawl data for some talk that I plan to do in 
2016. As part of this I built a small framework to let me run the POI 
integrationtest-framework on a large number of documents that I extracted from 
a number of CommonCrawl-runs. This is somewhat similar to what Tim is doing for 
Tika, but it focues on POI-related documents.
I tried to use this as a huge regression-check, in this case I compared 
relelase 3.13 and 3.14-beta1. In the future I can fairly easily run this 
against newer versions to check for any new regressions.


Some statistics:
* Overall I processed 829356 POI-related documents

* 687506 documents did process fine in both versions!
* 140699 documents caused parsing errors in both versions. Many of these are 
actually invalid documents, wrong file-types, incorrect mime-types, ... so the 
actuall error rate would be much lower, but it is currently not overly useful 
to look at these errors without first sorting out all the false-positives.

* 845 documents failed in POI 3.13 and now work in 3.14-beta1, so we made more 
documents succeed now, jay!

* And finally 306 documents did fail in POI-3.14-beta1 while they processed 
fine with POI-3.13.


However these potential regressions have the following causes:

** aprox 280 of these were caused because we do more checks for HSLF now
** 19 were OOMs that happen in my framework with large documents due to 
parallel processing
** One document fails Date-parsing where I don't see how it did work before, 
maybe this is also caused by more testing now
** 5 documents failed due to the new support for multi-part formats and locale 
id
** One document showed an NPE in HSLFTextParagraph

So only the last two look like actual regressions, I will commit fixes together 
with reproducing files for these two shortly.

I store the results into a database, so I can query on the results in various 
ways:

E.g. attached is the list of top 100 exception-messages for the failed files.

Let me know if you would like to get a full stacktrace and document for any of 
those or if you have suggestions for additional queries/checks that we could 
add here!

Dominik.

Reply via email to