[ 
https://issues.apache.org/jira/browse/PDFBOX-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5875:
------------------------------------
    Fix Version/s:     (was: 3.0.4 PDFBox)

> using font data to process ligatures
> ------------------------------------
>
>                 Key: PDFBOX-5875
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5875
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing, PDModel, Text extraction
>    Affects Versions: 3.0.3 PDFBox
>            Reporter: Manish S N
>            Priority: Major
>              Labels: Asian, CIDFont, font, ligatures, unicodemapping
>         Attachments: page.pdf
>
>
> To process ligatures from Asian languages (where a glyph is the combination 
> of two unicode characters) using the data in embedded fonts.
>  
> *The problem:*
> currently modern PDF creators put these ligatures in /ActualText field which 
> we only recently considered to support in this issue . But this is not the 
> case in old PDFs with embedded CID fonts like [^page.pdf] where the glyphs of 
> ligatures lack a /toUnicode character mapping because there is no single 
> unicode codepoint for these as these are combination of more than one unicode 
> characters. 
>  
> *The Potential Solution (if not perfect):* 
> I managed to extract the font files using pdfbox 
> ([code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java])
>  and when i viewed the fontfiles using fontforge i found the data about 
> ligatures intact in it. So we can use this data to map the glyphs that are 
> ligatures to the unicodes of its constituent glyphs
>  
> *Problems:*
> In some cases the constituent glyphs may not be present in the cmap at all. 
> removed by PDF optimiser as it is never directly used in the PDF apart from 
> in ligatures. such glyphs are empty with only glyph id and no /toUnicode 
> mapping even if that particular glyph has a corresponding unicode character.
>  
> *The Hope:*
> This is not a common problem in large PDFs. and basic spell checkers could 
> easily rectify the problem. some comprehension is better than no 
> comprehension when it comes to dealing with data. this will greatly enhance 
> the parsing of non-Latin Asian languages.
>  
> (the PDF sample i attached is in Tamil language)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to