Am 28.07.20 um 23:51 schrieb Tim Allison:
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
Looks like extraction improved slightly. I found a bug at the Tika level
that is creating a few more exceptions (will fix soon), but this is not a
problem for PDFBox.
There are some new PDFBox relate3d exceptions. 3 of them are already in 2.0.21
(the AES and the CodespaceRange issue). I'm investigating the others.
I was able to turn back on our unit test that counted characters and
non-unicode mapped characters.
I'll look a bit tomorrow, but this looks good to me.
@Tim thanks again for running those tests. I've stumbled upon one minor glitch
in your reports. There are two sheets about parse time. The overall report
parse_time_millis_by_mime_compared.xlsx states that the pdf parsing time has
decreased to 88% but if I've a look in the details report
parse_time_millis_details.xlsx is looks the parsing time increases.
Am I mistaken or is there a glitch in your report (swapped columns)?
Again, many thanks to Maruan! The processing speeds were, um, much, much
faster.
Best,
Tim
On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <[email protected]>
wrote:
Yes, please
Thanks in advance!
Am 28.07.20 um 12:45 schrieb Tim Allison:
Y. I can run these today
On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <[email protected]>
wrote:
Hi,
is there any chance to run the PDFBox regression tests (2.0.20 vs.
SNAPSHOT) on
our new box? Does anyone had the cycles to prepare something ready to
start?
If not, is there anything I can do to help? I'm planning to cut a new
PDFBox
release soon.
Cheers
Andreas
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]