MMG created PDFBOX-5719:
---------------------------

             Summary: PDFbox fix 
                 Key: PDFBOX-5719
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5719
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
    Affects Versions: 2.0.26
         Environment: OS: Ubuntu
Java: 16
            Reporter: MMG
         Attachments: Kommunikationsbedingungen-Einlagen_FIDOR-Bank.pdf

Hello,

I am experiencing an issue related to the "No Unicode Mapping" warning in the 
PDFBox debugger. Similar to Apache DebugBar, I am saving font glyphs to disk 
and then using an AI to detect the characters. My objective is to update the 
font Unicode map based on the AI results and save the PDF.

Here's my main idea: Save unknown glyph Unicode mappings to disk, send each 
image to the AI for detection, and then update the font Unicode mapping. I 
found a helpful example on Stack Overflow (link: 
[https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0]),
 where the solution involves creating a CosStream to update the font Unicode 
mapping. This approach seems suitable for my needs.

In the mentioned question, the answer suggests creating a CosStream to update 
the font Unicode mapping. I want to retrieve the ToUnicode text as shown in the 
mentioned question and modify the text to fix the font Unicode, then update the 
font. However, I am unsure of how to obtain the ToUnicode text view (similar to 
the PDF debugger).

Can anyone provide assistance on how to address this issue? Any help would be 
greatly appreciated.

Sample pdf file attached



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to