[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

JIRA Mon, 08 Dec 2014 08:37:39 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matthias Bösinger updated PDFBOX-2548:
--------------------------------------
    Description: 
 favorite
        

I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
pdfBox text extraction can also extract special characters (for example small 
capital lettres), which caused problems when the underlying font has been a 
simple Type1 font.

However, the text extraction now causes another type of problem. In my case, 
when the charater sequences "fi" or "fl" occur in the text, the 
PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'ﬁ' 
and 'ﬂ' and sets a space character on their right side.

(Surprisingly, if I access the list of characters of a page via the 
charactersByArticle field of PDFTextStripper / via the 
PDFTextStripper#processText(TextPosition pos) method, the same characters show 
up as 'normal-single' characters f i / f l).

My assumption is that the advantage of the underlying OpenFont type turns into 
this particular disadvantage, because the PDFTextStripper recognizes the 
character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to 
do with the fact, that the getText() method calculates things like whitespace 
characters by distances / positional placements).

Background: The given document is a wordbook text with very dense printed text.

see this link for code and output:
http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox

My question: is there anything what I can do to avoid this problem?

thanks in advance ...


  was:
 favorite
        

I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
pdfBox text extraction can also extract special characters (for example small 
capital lettres), which caused problems when the underlying font has been a 
simple Type1 font.

However, the text extraction now causes another type of problem. In my case, 
when the charater sequences "fi" or "fl" occur in the text, the 
PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'ﬁ' 
and 'ﬂ' and sets a space character on their right side.

(Surprisingly, if I access the list of characters of a page via the 
charactersByArticle field of PDFTextStripper / via the 
PDFTextStripper#processText(TextPosition pos) method, the same characters show 
up as 'normal-single' characters f i / f l).

My assumption is that the advantage of the underlying OpenFont type turns into 
this particular disadvantage, because the PDFTextStripper recognizes the 
character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to 
do with the fact, that the getText() method calculates things like whitespace 
characters by distances / positional placements).

Background: The given document is a wordbook text with very dense printed text.

My question: is there anything what I can do to avoid this problem?

thanks in advance ...



> problems with character extraction (OpenType, dense printed Text)
> -----------------------------------------------------------------
>
>                 Key: PDFBOX-2548
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
>             Project: PDFBox
>          Issue Type: Test
>          Components: Text extraction
>    Affects Versions: 1.8.7
>         Environment: Windows JavaSE8 Eclipse
>            Reporter: Matthias Bösinger
>            Priority: Minor
>              Labels: newbie
>         Attachments: test.pdf
>
>
>  favorite
>       
> I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
> pdfBox text extraction can also extract special characters (for example small 
> capital lettres), which caused problems when the underlying font has been a 
> simple Type1 font.
> However, the text extraction now causes another type of problem. In my case, 
> when the charater sequences "fi" or "fl" occur in the text, the 
> PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
> 'ﬁ' and 'ﬂ' and sets a space character on their right side.
> (Surprisingly, if I access the list of characters of a page via the 
> charactersByArticle field of PDFTextStripper / via the 
> PDFTextStripper#processText(TextPosition pos) method, the same characters 
> show up as 'normal-single' characters f i / f l).
> My assumption is that the advantage of the underlying OpenFont type turns 
> into this particular disadvantage, because the PDFTextStripper recognizes the 
> character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to 
> do with the fact, that the getText() method calculates things like whitespace 
> characters by distances / positional placements).
> Background: The given document is a wordbook text with very dense printed 
> text.
> see this link for code and output:
> http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
> My question: is there anything what I can do to avoid this problem?
> thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PDFBOX-2548) problems with character extraction (OpenType, dense printed Text)

Reply via email to