[ https://issues.apache.org/jira/browse/PDFBOX-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920140#action_12920140 ]
Johannes Koch commented on PDFBOX-860: -------------------------------------- Hi Saurabh Looks like you make use of ligature characters, like unicode character FB01 (LATIN SMALL LIGATURE FI). I'm just guessing: 1. PDFBox does not support this, or 2. Your output does not support this character (no mapping in the used character encoding, or no glyph in the used font). > 'fi' getting converted to '?' > ----------------------------- > > Key: PDFBOX-860 > URL: https://issues.apache.org/jira/browse/PDFBOX-860 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.2.1 > Environment: Solaris 10 > Reporter: Saurabh Mehrotra > > Hi > I am trying to use PDF box 1.2.1 version to extract text from PDF files. > The following issue is observed in the extracted text: > 1. Combination of the characters 'fi' is converted to a '?' > example: first becomes ?rst > classifier becomes classi?er > find becomes ?nd > Is this a known bug? Can some setting of the PDF box be turned of to prevent > this? > Thanks & Regards > Saurabh -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.