[ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--------------------------------------
    Environment: Windows7Professional JavaSE8 EclipseKepler  (was: Windows 
JavaSE8 EclipseKepler)

> problems with character extraction (OpenType, dense printed Text)
> -----------------------------------------------------------------
>
>                 Key: PDFBOX-2548
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
>             Project: PDFBox
>          Issue Type: Test
>          Components: Text extraction
>    Affects Versions: 1.8.7
>         Environment: Windows7Professional JavaSE8 EclipseKepler
>            Reporter: Matthias Bösinger
>            Priority: Minor
>              Labels: newbie
>         Attachments: test.pdf
>
>
>  favorite
>       
> I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
> pdfBox text extraction can also extract special characters (for example small 
> capital lettres), which caused problems when the underlying font has been a 
> simple Type1 font.
> However, the text extraction now causes another type of problem. In my case, 
> when the charater sequences "fi" or "fl" occur in the text, the 
> PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
> 'fi' and 'fl' and sets a space character on their right side.
> (Surprisingly, if I access the list of characters of a page via the 
> charactersByArticle field of PDFTextStripper / via the 
> PDFTextStripper#processText(TextPosition pos) method, the same characters 
> show up as 'normal-single' characters f i / f l).
> My assumption is that the advantage of the underlying OpenFont type turns 
> into this particular disadvantage, because the PDFTextStripper recognizes the 
> character sequence f i / f l as special charcters fi / fl (- what might have to 
> do with the fact, that the getText() method calculates things like whitespace 
> characters by distances / positional placements).
> Background: The given document is a wordbook text with very dense printed 
> text.
> see this link for code and output:
> http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
> My question: is there anything what I can do to avoid this problem?
> thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to