MMG created PDFBOX-5719:
---------------------------
Summary: PDFbox fix
Key: PDFBOX-5719
URL: https://issues.apache.org/jira/browse/PDFBOX-5719
Project: PDFBox
Issue Type: Improvement
Components: Text extraction
Affects Versions: 2.0.26
Environment: OS: Ubuntu
Java: 16
Reporter: MMG
Attachments: Kommunikationsbedingungen-Einlagen_FIDOR-Bank.pdf
Hello,
I am experiencing an issue related to the "No Unicode Mapping" warning in the
PDFBox debugger. Similar to Apache DebugBar, I am saving font glyphs to disk
and then using an AI to detect the characters. My objective is to update the
font Unicode map based on the AI results and save the PDF.
Here's my main idea: Save unknown glyph Unicode mappings to disk, send each
image to the AI for detection, and then update the font Unicode mapping. I
found a helpful example on Stack Overflow (link:
[https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0]),
where the solution involves creating a CosStream to update the font Unicode
mapping. This approach seems suitable for my needs.
In the mentioned question, the answer suggests creating a CosStream to update
the font Unicode mapping. I want to retrieve the ToUnicode text as shown in the
mentioned question and modify the text to fix the font Unicode, then update the
font. However, I am unsure of how to obtain the ToUnicode text view (similar to
the PDF debugger).
Can anyone provide assistance on how to address this issue? Any help would be
greatly appreciated.
Sample pdf file attached
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]