Re: PDFBox regression tests?

Andreas Lehmkuehler Wed, 29 Jul 2020 08:33:29 -0700

Am 28.07.20 um 23:51 schrieb Tim Allison:

Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz


Looks like extraction improved slightly.  I found a bug at the Tika level
that is creating a few more exceptions (will fix soon), but this is not a
problem for PDFBox.

There are some new PDFBox relate3d exceptions. 3 of them are already in 2.0.21(the AES and the CodespaceRange issue). I'm investigating the others.

I was able to turn back on our unit test that counted characters and
non-unicode mapped characters.

I'll look a bit tomorrow, but this looks good to me.

@Tim thanks again for running those tests. I've stumbled upon one minor glitchin your reports. There are two sheets about parse time. The overall reportparse_time_millis_by_mime_compared.xlsx states that the pdf parsing time hasdecreased to 88% but if I've a look in the details reportparse_time_millis_details.xlsx is looks the parsing time increases.


Am I mistaken or is there a glitch in your report (swapped columns)?


Again, many thanks to Maruan!  The processing speeds were, um, much, much
faster.

Best,

        Tim

On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <[email protected]>
wrote:

Yes, please

Thanks in advance!

Am 28.07.20 um 12:45 schrieb Tim Allison:

Y. I can run these today

On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <[email protected]>
wrote:

Hi,

is there any chance to run the PDFBox regression tests (2.0.20 vs.
SNAPSHOT) on
our new box? Does anyone had the cycles to prepare something ready to
start?

If not, is there anything I can do to help? I'm planning to cut a new
PDFBox
release soon.

Cheers
Andreas



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: PDFBox regression tests?

Reply via email to