[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Attachment: (was: test2.pdf)

> Problems with character extraction (fi ligature)
> 
>
> Key: PDFBOX-2548
> URL: https://issues.apache.org/jira/browse/PDFBOX-2548
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7
> Environment: Windows7Professional JavaSE8 EclipseKepler
>Reporter: Matthias Bösinger
>Priority: Minor
> Attachments: preflight.png
>
>
>  favorite
>   
> I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
> pdfBox text extraction can also extract special characters (for example small 
> capital lettres), which caused problems when the underlying font has been a 
> simple Type1 font.
> However, the text extraction now causes another type of problem. In my case, 
> when the charater sequences "fi" or "fl" occur in the text, the 
> PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
> 'fi' and 'fl' and sets a space character on their right side.
> (Surprisingly, if I access the list of characters of a page via the 
> charactersByArticle field of PDFTextStripper / via the 
> PDFTextStripper#processText(TextPosition pos) method, the same characters 
> show up as 'normal-single' characters f i / f l).
> My assumption is that the advantage of the underlying OpenFont type turns 
> into this particular disadvantage, because the PDFTextStripper recognizes the 
> character sequence f i / f l as special charcters fi / fl (- what might have to 
> do with the fact, that the getText() method calculates things like whitespace 
> characters by distances / positional placements).
> Background: The given document is a wordbook text with very dense printed 
> text.
> see this link for code and output:
> http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
> My question: is there anything what I can do to avoid this problem?
> thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Bösinger updated PDFBOX-2548:
--
Attachment: (was: test.pdf)

> Problems with character extraction (fi ligature)
> 
>
> Key: PDFBOX-2548
> URL: https://issues.apache.org/jira/browse/PDFBOX-2548
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7
> Environment: Windows7Professional JavaSE8 EclipseKepler
>Reporter: Matthias Bösinger
>Priority: Minor
> Attachments: preflight.png
>
>
>  favorite
>   
> I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
> pdfBox text extraction can also extract special characters (for example small 
> capital lettres), which caused problems when the underlying font has been a 
> simple Type1 font.
> However, the text extraction now causes another type of problem. In my case, 
> when the charater sequences "fi" or "fl" occur in the text, the 
> PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
> 'fi' and 'fl' and sets a space character on their right side.
> (Surprisingly, if I access the list of characters of a page via the 
> charactersByArticle field of PDFTextStripper / via the 
> PDFTextStripper#processText(TextPosition pos) method, the same characters 
> show up as 'normal-single' characters f i / f l).
> My assumption is that the advantage of the underlying OpenFont type turns 
> into this particular disadvantage, because the PDFTextStripper recognizes the 
> character sequence f i / f l as special charcters fi / fl (- what might have to 
> do with the fact, that the getText() method calculates things like whitespace 
> characters by distances / positional placements).
> Background: The given document is a wordbook text with very dense printed 
> text.
> see this link for code and output:
> http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
> My question: is there anything what I can do to avoid this problem?
> thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2548:

Attachment: preflight.png

> Problems with character extraction (fi ligature)
> 
>
> Key: PDFBOX-2548
> URL: https://issues.apache.org/jira/browse/PDFBOX-2548
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7
> Environment: Windows7Professional JavaSE8 EclipseKepler
>Reporter: Matthias Bösinger
>Priority: Minor
> Attachments: preflight.png, test.pdf, test2.pdf
>
>
>  favorite
>   
> I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
> pdfBox text extraction can also extract special characters (for example small 
> capital lettres), which caused problems when the underlying font has been a 
> simple Type1 font.
> However, the text extraction now causes another type of problem. In my case, 
> when the charater sequences "fi" or "fl" occur in the text, the 
> PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
> 'fi' and 'fl' and sets a space character on their right side.
> (Surprisingly, if I access the list of characters of a page via the 
> charactersByArticle field of PDFTextStripper / via the 
> PDFTextStripper#processText(TextPosition pos) method, the same characters 
> show up as 'normal-single' characters f i / f l).
> My assumption is that the advantage of the underlying OpenFont type turns 
> into this particular disadvantage, because the PDFTextStripper recognizes the 
> character sequence f i / f l as special charcters fi / fl (- what might have to 
> do with the fact, that the getText() method calculates things like whitespace 
> characters by distances / positional placements).
> Background: The given document is a wordbook text with very dense printed 
> text.
> see this link for code and output:
> http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
> My question: is there anything what I can do to avoid this problem?
> thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)

2014-12-08 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson updated PDFBOX-2548:

Summary: Problems with character extraction (fi ligature)  (was: problems 
with character extraction (OpenType, dense printed Text))

> Problems with character extraction (fi ligature)
> 
>
> Key: PDFBOX-2548
> URL: https://issues.apache.org/jira/browse/PDFBOX-2548
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7
> Environment: Windows7Professional JavaSE8 EclipseKepler
>Reporter: Matthias Bösinger
>Priority: Minor
> Attachments: test.pdf, test2.pdf
>
>
>  favorite
>   
> I have a pdf document whose font type is OpenType (Garamond OpenType). So the 
> pdfBox text extraction can also extract special characters (for example small 
> capital lettres), which caused problems when the underlying font has been a 
> simple Type1 font.
> However, the text extraction now causes another type of problem. In my case, 
> when the charater sequences "fi" or "fl" occur in the text, the 
> PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 
> 'fi' and 'fl' and sets a space character on their right side.
> (Surprisingly, if I access the list of characters of a page via the 
> charactersByArticle field of PDFTextStripper / via the 
> PDFTextStripper#processText(TextPosition pos) method, the same characters 
> show up as 'normal-single' characters f i / f l).
> My assumption is that the advantage of the underlying OpenFont type turns 
> into this particular disadvantage, because the PDFTextStripper recognizes the 
> character sequence f i / f l as special charcters fi / fl (- what might have to 
> do with the fact, that the getText() method calculates things like whitespace 
> characters by distances / positional placements).
> Background: The given document is a wordbook text with very dense printed 
> text.
> see this link for code and output:
> http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
> My question: is there anything what I can do to avoid this problem?
> thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)