[ https://issues.apache.org/jira/browse/PDFBOX-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919514#comment-13919514 ]
Tilman Hausherr commented on PDFBOX-1956: ----------------------------------------- The visual is just a bunch of glyph *images* that happen to make sense to you because you can read. To make a check if the pdf is searchable one solution could be to test whether the extracted words can be found in a dictionary of common words. > Wrong character on conversion PDF to TXT > ---------------------------------------- > > Key: PDFBOX-1956 > URL: https://issues.apache.org/jira/browse/PDFBOX-1956 > Project: PDFBox > Issue Type: Task > Components: Parsing > Affects Versions: 1.8.4 > Environment: Windows > Reporter: Vicente > Labels: parser > Attachments: example b.pdf, itext_pdfabc-sample.pdf > > > I am trying to convert PDF to TXT and some PDF, after converted, the String > present wrong character. Could be UNICODE problem ? Can somebody help me ? > I oberved that the problem when try to convert PDF, created by PDFCreator, in > Text. The character are wrong. Any suggesting ? > the code > public class PDFTextParser { > > PDFParser parser; > String parsedText; > PDFTextStripper pdfStripper; > PDDocument pdDoc; > COSDocument cosDoc; > PDDocumentInformation pdDocInfo; > > // PDFTextParser Constructor > public PDFTextParser() { > } > > // Extract text from PDF Document > public String pdftoText(String fileName) { > > System.out.println("Parsing text from PDF file " + fileName + "...."); > File f = new File(fileName); > > if (!f.isFile()) { > System.out.println("File " + fileName + " does not exist."); > return null; > } > > try { > parser = new PDFParser(new FileInputStream(f)); > } catch (Exception e) { > System.out.println("Unable to open PDF Parser."); > return null; > } > > try { > parser.parse(); > cosDoc = parser.getDocument(); > pdfStripper = new PDFTextStripper(); > pdDoc = new PDDocument(cosDoc); > parsedText = pdfStripper.getText(pdDoc); > } catch (Exception e) { > System.out.println("An exception occured in parsing the PDF > Document."); > e.printStackTrace(); > try { > if (cosDoc != null) cosDoc.close(); > if (pdDoc != null) pdDoc.close(); > } catch (Exception e1) { > e.printStackTrace(); > } > return null; > } > System.out.println("Done."); > return parsedText; > } > -- This message was sent by Atlassian JIRA (v6.2#6252)