Re: How to search for / extract text of form field

2024-03-24 Thread Gilad Denneboom
Flatten the form fields before searching the file if you want PDFTextStripper to find the text in them. On Thu, Mar 21, 2024 at 12:10 PM Paul Grütter wrote: > Hello list, > > > > I want to search for words in a PDF document and get their positions. It > seems that PDFBox ignores text which has

[ANNOUNCE] Apache PDFBox 2.0.31 released

2024-03-24 Thread Andreas Lehmkühler
The Apache PDFBox community is pleased to announce the release of Apache PDFBox version 2.0.31 The release is available for download at: https://pdfbox.apache.org/download.html See the full release notes below for details about this release. Release Notes -- Apache PDFBox -- Version 2.0.31

Re: Type 0 font - Text extraction X PDF Debugger

2024-03-24 Thread Tilman Hausherr
Here they are, remove the XXX https://corpora.tika.apache.org/XXXbase/docs/govdocs1/433/433525.pdf https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/O2/O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/R4/R4EXG25W532JHDQLJAM4HF6O532TLR7D The

Re: Type 0 font - Text extraction X PDF Debugger

2024-03-24 Thread Andreas Lehmkühler
Am 15.03.24 um 05:35 schrieb Tilman Hausherr: You are correct that it's the "fb" parts that are missing. (And some of the other tools you tried also mention this) Just adding true results in text extraction of several files no longer being correct, 433525-p1.pdf