[ https://issues.apache.org/jira/browse/PDFBOX-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr updated PDFBOX-5875: ------------------------------------ Fix Version/s: (was: 3.0.4 PDFBox) > using font data to process ligatures > ------------------------------------ > > Key: PDFBOX-5875 > URL: https://issues.apache.org/jira/browse/PDFBOX-5875 > Project: PDFBox > Issue Type: New Feature > Components: Parsing, PDModel, Text extraction > Affects Versions: 3.0.3 PDFBox > Reporter: Manish S N > Priority: Major > Labels: Asian, CIDFont, font, ligatures, unicodemapping > Attachments: page.pdf > > > To process ligatures from Asian languages (where a glyph is the combination > of two unicode characters) using the data in embedded fonts. > > *The problem:* > currently modern PDF creators put these ligatures in /ActualText field which > we only recently considered to support in this issue . But this is not the > case in old PDFs with embedded CID fonts like [^page.pdf] where the glyphs of > ligatures lack a /toUnicode character mapping because there is no single > unicode codepoint for these as these are combination of more than one unicode > characters. > > *The Potential Solution (if not perfect):* > I managed to extract the font files using pdfbox > ([code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java]) > and when i viewed the fontfiles using fontforge i found the data about > ligatures intact in it. So we can use this data to map the glyphs that are > ligatures to the unicodes of its constituent glyphs > > *Problems:* > In some cases the constituent glyphs may not be present in the cmap at all. > removed by PDF optimiser as it is never directly used in the PDF apart from > in ligatures. such glyphs are empty with only glyph id and no /toUnicode > mapping even if that particular glyph has a corresponding unicode character. > > *The Hope:* > This is not a common problem in large PDFs. and basic spell checkers could > easily rectify the problem. some comprehension is better than no > comprehension when it comes to dealing with data. this will greatly enhance > the parsing of non-Latin Asian languages. > > (the PDF sample i attached is in Tamil language) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org