[ https://issues.apache.org/jira/browse/PDFBOX-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919787#comment-13919787 ]
John Hewson edited comment on PDFBOX-1956 at 3/4/14 7:15 PM: ------------------------------------------------------------- {quote} Do you know how I can check problem in PDF (like this) ? Working with PDFBOX is possible to check it ? {quote} We see PDFs like this fairly often, the problem is that the text embedded in the PDF is perfectly valid, it's just that the font's encoding is meaningless to a human. The embedded font maps the character to a glyph which is obviously the letter "P" but we have no way to know this, as the glyph claims to be . To detect PDFs with this problem, you could try https://code.google.com/p/language-detection/ and see if the language identified is what you were expecting. Let me know if you try this and it works. was (Author: jahewson): {quote} Do you know how I can check problem in PDF (like this) ? Working with PDFBOX is possible to check it ? {quote} We see PDFs like this fairly often, the problem is that the text embedded in PDF is perfectly valid, it's just that the font's encoding is meaningless to a human. The embedded font maps the character to a glyph which is obviously the letter "P" but we have no way to know this, from our point of view the glyph claims to be . To detect PDFs with this problem, you could try https://code.google.com/p/language-detection/ and see if the language identified is what you were expecting. Let me know if you try this and it works. > Wrong character on conversion PDF to TXT > ---------------------------------------- > > Key: PDFBOX-1956 > URL: https://issues.apache.org/jira/browse/PDFBOX-1956 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 1.8.4 > Environment: Windows > Reporter: Vicente > Priority: Minor > Labels: parser > Attachments: example b.pdf, itext_pdfabc-sample.pdf > > > I am trying to convert PDF to TXT and some PDF, after converted, the String > present wrong character. Could be UNICODE problem ? Can somebody help me ? > I oberved that the problem when try to convert PDF, created by PDFCreator, in > Text. The character are wrong. Any suggesting ? > the code > public class PDFTextParser { > > PDFParser parser; > String parsedText; > PDFTextStripper pdfStripper; > PDDocument pdDoc; > COSDocument cosDoc; > PDDocumentInformation pdDocInfo; > > // PDFTextParser Constructor > public PDFTextParser() { > } > > // Extract text from PDF Document > public String pdftoText(String fileName) { > > System.out.println("Parsing text from PDF file " + fileName + "...."); > File f = new File(fileName); > > if (!f.isFile()) { > System.out.println("File " + fileName + " does not exist."); > return null; > } > > try { > parser = new PDFParser(new FileInputStream(f)); > } catch (Exception e) { > System.out.println("Unable to open PDF Parser."); > return null; > } > > try { > parser.parse(); > cosDoc = parser.getDocument(); > pdfStripper = new PDFTextStripper(); > pdDoc = new PDDocument(cosDoc); > parsedText = pdfStripper.getText(pdDoc); > } catch (Exception e) { > System.out.println("An exception occured in parsing the PDF > Document."); > e.printStackTrace(); > try { > if (cosDoc != null) cosDoc.close(); > if (pdDoc != null) pdDoc.close(); > } catch (Exception e1) { > e.printStackTrace(); > } > return null; > } > System.out.println("Done."); > return parsedText; > } > -- This message was sent by Atlassian JIRA (v6.2#6252)