Hi,

The problem is in the ToUnicode stream, there's a log message "Invalid ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings. PDFBox is trying a fallback solution which turns out to be wrong. This is related to PDFBOX-5540 and earlier related issues.

Tilman



On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
Hi Tilman!

     Thank you very much for your attention!

     You can find the file "p4_alt.pdf" in this folder
<https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing>.
"Extra infos.pdf" file shows some output from PDF Debugger and others.

     I'm sorry, I sent the pdf file as an attachment in my first message,
but I didn't know that it wouldn't work.



Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <thaush...@t-online.de>
escreveu:

Hi,

please upload your file to a sharehoster.

Tilman

On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:
Hi everyone,

     I'm not sure if this is the same as FAQ "How come I am getting
gibberish(G38G43G36G51G5) when extracting text?"...

     I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
(build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).

     I'm trying to understand how this PDF chunk (from p4_fix.pdf
attached)
   BT
   /G1F7 6.0 Tf
   94.871 773.806 Td
   <004200430044> Tj
   ET

     becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction tool.

     Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.

     The renders that allow me to copy the text give me "BCD" text.

     It seems that PDFBox extraction tool follows the item "9.10.2
Mapping character codes to Unicode values" (ISO 32000-2:2020) but all
the others choose a different way.

      Could you help me to understand if there is a problem with the
PDF file, with the renders or with the extract text tool?

Thank you!



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to