On 01.04.2023 11:41, Tilman Hausherr wrote:
On 30.03.2023 16:27, Tim Allison wrote:
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.27-v-2.0.28-SNAPSHOT.tgz

Thank you Tim!

What I see is

1) Text missing in TOP_10_MORE_IN_B, these might (all?) be related to the issue that Andreas reopened

2) Different Arabic text => PDFBOX-4531, hopefully these are improvements

3) misc improvements, I'll add two of them to my own extraction regression tests

Tilman

Also some improved ligature text extraction, this might also be related to the PDFBOX-4531 changes. It can be seen in govdocs file 433525.pdf, in the first page "Neutron radiation offers" (ff now appears correctly)

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to