Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
Looks like extraction improved slightly. I found a bug at the Tika level
that is creating a few more exceptions (will fix soon), but this is not a
problem for PDFBox.
I was able to turn back on our unit test that counted characters and
non-unicode mapped characters.
I'll look a bit tomorrow, but this looks good to me.
Again, many thanks to Maruan! The processing speeds were, um, much, much
faster.
Best,
Tim
On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <[email protected]>
wrote:
> Yes, please
>
> Thanks in advance!
>
> Am 28.07.20 um 12:45 schrieb Tim Allison:
> > Y. I can run these today
> >
> > On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <[email protected]>
> > wrote:
> >
> >> Hi,
> >>
> >> is there any chance to run the PDFBox regression tests (2.0.20 vs.
> >> SNAPSHOT) on
> >> our new box? Does anyone had the cycles to prepare something ready to
> >> start?
> >>
> >> If not, is there anything I can do to help? I'm planning to cut a new
> >> PDFBox
> >> release soon.
> >>
> >> Cheers
> >> Andreas
> >>
> >
>
>