[ 
https://issues.apache.org/jira/browse/PDFBOX-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson resolved PDFBOX-755.
--------------------------------

    Resolution: Not a Problem

Update: Passing {{-encoding "UTF-8"}} to ExtractText gets me the combined 
characters as expected:

{code}
S. KALABUŠIĆ AND M. R. S. KULENOVIĆ
{code}

Which can be done programatically via:

{code}
new PDFTextStripper("UTF-8")
{code}

> Wrong translation of capital letters with combining diacritics
> --------------------------------------------------------------
>
>                 Key: PDFBOX-755
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-755
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0
>         Environment: Mac OS X 10.6.4
>            Reporter: Thomas Fischer
>         Attachments: 139-p.1+3.pdf, 139-p.1+3.txt
>
>
> S. KALABUˇSI ´C ANDM. R. S. KULENOVI ´C
> vs.
> S. KALABUŠIĆ AND M. R. S. KULENOVIĆ 
> 1.  ´ before vs.  ́ behind the letter (\x20 \xB4 vs. \x301)
> 2. ˇ before vs. ̌ behind the letter (\x27C vs. \x30C)
> 3. ANDM. : space missing
> Note:
> S. Kalabušić is translated correctly



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to