[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Bösinger updated PDFBOX-2548: -- Attachment: (was: test2.pdf) > Problems with character extraction (fi ligature) > > > Key: PDFBOX-2548 > URL: https://issues.apache.org/jira/browse/PDFBOX-2548 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.7 > Environment: Windows7Professional JavaSE8 EclipseKepler >Reporter: Matthias Bösinger >Priority: Minor > Attachments: preflight.png > > > favorite > > I have a pdf document whose font type is OpenType (Garamond OpenType). So the > pdfBox text extraction can also extract special characters (for example small > capital lettres), which caused problems when the underlying font has been a > simple Type1 font. > However, the text extraction now causes another type of problem. In my case, > when the charater sequences "fi" or "fl" occur in the text, the > PDFTextStripper#getText(PDDocument doc) extracts them as single characters: > 'fi' and 'fl' and sets a space character on their right side. > (Surprisingly, if I access the list of characters of a page via the > charactersByArticle field of PDFTextStripper / via the > PDFTextStripper#processText(TextPosition pos) method, the same characters > show up as 'normal-single' characters f i / f l). > My assumption is that the advantage of the underlying OpenFont type turns > into this particular disadvantage, because the PDFTextStripper recognizes the > character sequence f i / f l as special charcters fi / fl (- what might have to > do with the fact, that the getText() method calculates things like whitespace > characters by distances / positional placements). > Background: The given document is a wordbook text with very dense printed > text. > see this link for code and output: > http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox > My question: is there anything what I can do to avoid this problem? > thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias Bösinger updated PDFBOX-2548: -- Attachment: (was: test.pdf) > Problems with character extraction (fi ligature) > > > Key: PDFBOX-2548 > URL: https://issues.apache.org/jira/browse/PDFBOX-2548 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.7 > Environment: Windows7Professional JavaSE8 EclipseKepler >Reporter: Matthias Bösinger >Priority: Minor > Attachments: preflight.png > > > favorite > > I have a pdf document whose font type is OpenType (Garamond OpenType). So the > pdfBox text extraction can also extract special characters (for example small > capital lettres), which caused problems when the underlying font has been a > simple Type1 font. > However, the text extraction now causes another type of problem. In my case, > when the charater sequences "fi" or "fl" occur in the text, the > PDFTextStripper#getText(PDDocument doc) extracts them as single characters: > 'fi' and 'fl' and sets a space character on their right side. > (Surprisingly, if I access the list of characters of a page via the > charactersByArticle field of PDFTextStripper / via the > PDFTextStripper#processText(TextPosition pos) method, the same characters > show up as 'normal-single' characters f i / f l). > My assumption is that the advantage of the underlying OpenFont type turns > into this particular disadvantage, because the PDFTextStripper recognizes the > character sequence f i / f l as special charcters fi / fl (- what might have to > do with the fact, that the getText() method calculates things like whitespace > characters by distances / positional placements). > Background: The given document is a wordbook text with very dense printed > text. > see this link for code and output: > http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox > My question: is there anything what I can do to avoid this problem? > thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2548: Attachment: preflight.png > Problems with character extraction (fi ligature) > > > Key: PDFBOX-2548 > URL: https://issues.apache.org/jira/browse/PDFBOX-2548 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.7 > Environment: Windows7Professional JavaSE8 EclipseKepler >Reporter: Matthias Bösinger >Priority: Minor > Attachments: preflight.png, test.pdf, test2.pdf > > > favorite > > I have a pdf document whose font type is OpenType (Garamond OpenType). So the > pdfBox text extraction can also extract special characters (for example small > capital lettres), which caused problems when the underlying font has been a > simple Type1 font. > However, the text extraction now causes another type of problem. In my case, > when the charater sequences "fi" or "fl" occur in the text, the > PDFTextStripper#getText(PDDocument doc) extracts them as single characters: > 'fi' and 'fl' and sets a space character on their right side. > (Surprisingly, if I access the list of characters of a page via the > charactersByArticle field of PDFTextStripper / via the > PDFTextStripper#processText(TextPosition pos) method, the same characters > show up as 'normal-single' characters f i / f l). > My assumption is that the advantage of the underlying OpenFont type turns > into this particular disadvantage, because the PDFTextStripper recognizes the > character sequence f i / f l as special charcters fi / fl (- what might have to > do with the fact, that the getText() method calculates things like whitespace > characters by distances / positional placements). > Background: The given document is a wordbook text with very dense printed > text. > see this link for code and output: > http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox > My question: is there anything what I can do to avoid this problem? > thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PDFBOX-2548) Problems with character extraction (fi ligature)
[ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Hewson updated PDFBOX-2548: Summary: Problems with character extraction (fi ligature) (was: problems with character extraction (OpenType, dense printed Text)) > Problems with character extraction (fi ligature) > > > Key: PDFBOX-2548 > URL: https://issues.apache.org/jira/browse/PDFBOX-2548 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.7 > Environment: Windows7Professional JavaSE8 EclipseKepler >Reporter: Matthias Bösinger >Priority: Minor > Attachments: test.pdf, test2.pdf > > > favorite > > I have a pdf document whose font type is OpenType (Garamond OpenType). So the > pdfBox text extraction can also extract special characters (for example small > capital lettres), which caused problems when the underlying font has been a > simple Type1 font. > However, the text extraction now causes another type of problem. In my case, > when the charater sequences "fi" or "fl" occur in the text, the > PDFTextStripper#getText(PDDocument doc) extracts them as single characters: > 'fi' and 'fl' and sets a space character on their right side. > (Surprisingly, if I access the list of characters of a page via the > charactersByArticle field of PDFTextStripper / via the > PDFTextStripper#processText(TextPosition pos) method, the same characters > show up as 'normal-single' characters f i / f l). > My assumption is that the advantage of the underlying OpenFont type turns > into this particular disadvantage, because the PDFTextStripper recognizes the > character sequence f i / f l as special charcters fi / fl (- what might have to > do with the fact, that the getText() method calculates things like whitespace > characters by distances / positional placements). > Background: The given document is a wordbook text with very dense printed > text. > see this link for code and output: > http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox > My question: is there anything what I can do to avoid this problem? > thanks in advance ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)