[jira] [Created] (PDFBOX-5875) using font data to process ligatures

Manish S N (Jira) Fri, 30 Aug 2024 06:26:37 -0700

Manish S N created PDFBOX-5875:
----------------------------------

             Summary: using font data to process ligatures
                 Key: PDFBOX-5875
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5875
             Project: PDFBox
          Issue Type: New Feature
          Components: Parsing, PDModel, Text extraction
    Affects Versions: 3.0.3 PDFBox
            Reporter: Manish S N
             Fix For: 3.0.4 PDFBox
         Attachments: page.pdf

To process ligatures from Asian languages (where a glyph is the combination of
two unicode characters) using the data in embedded fonts.

*The problem:*

currently modern PDF creators put these ligatures in /ActualText field which we
only recently considered to support in this issue . But this is not the case in
old PDFs with embedded CID fonts like [^page.pdf] where the glyphs of ligatures
lack a /toUnicode character mapping because there is no single unicode
codepoint for these as these are combination of more than one unicode
characters.

*The Potential Solution (if not perfect):*

I managed to extract the font files using pdfbox
([code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java])
and when i viewed the fontfiles using fontforge i found the data about
ligatures intact in it. So we can use this data to map the glyphs that are
ligatures to the unicodes of its constituent glyphs

*Problems:*

In some cases the constituent glyphs may not be present in the cmap at all.
removed by PDF optimiser as it is never directly used in the PDF apart from in
ligatures. such glyphs are empty with only glyph id and no /toUnicode mapping
even if that particular glyph has a corresponding unicode character.

*The Hope:*

This is not a common problem in large PDFs. and basic spell checkers could
easily rectify the problem. some comprehension is better than no comprehension
when it comes to dealing with data. this will greatly enhance the parsing of
non-Latin Asian languages.

(the PDF sample i attached is in Tamil language)

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-5875) using font data to process ligatures

Reply via email to