[ 
https://issues.apache.org/jira/browse/PDFBOX-5808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841093#comment-17841093
 ] 

Tilman Hausherr commented on PDFBOX-5808:
-----------------------------------------

I tested just the tokenizer changes (and related) and now "affine" looks 
better. However text extraction doesn't work for "in", regardless whether alone 
or in "affine". The cause is probably in the font itself. "ff" maps to unicode 
0xfb00, but "in" maps to unicode 0xe0a2 which is "private use" according to 
https://www.compart.com/de/unicode/U+E0A2 .

> Add support for GSUB Lookup Type 3
> ----------------------------------
>
>                 Key: PDFBOX-5808
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5808
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: FontBox
>    Affects Versions: 3.0.2 PDFBox
>            Reporter: Fabrice Calafat
>            Priority: Major
>
> Add support for the lookup type 3, Alternate Substitution when handling GSUB:
> [https://learn.microsoft.com/en-us/typography/opentype/spec/gsub#AS]
> The first available substitution glyph can be used (as done in other 
> libraries)
>  
> Also, the current implementation of CompoundCharacterTokenizer doesn't 
> account for collision in ligatures
> For example, if a font supports ligatures for _att_ and {_}en{_}, the current 
> implementation will not tokenize properly for the word _attention._ This is 
> because the regex implementation doesn't allow for a proper split
>  
> I'll open a proposed implementation for the above



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to