Vicente created PDFBOX-1956: ------------------------------- Summary: Wrong character on conversion PDF to TXT Key: PDFBOX-1956 URL: https://issues.apache.org/jira/browse/PDFBOX-1956 Project: PDFBox Issue Type: Task Components: Parsing Affects Versions: 1.8.4 Environment: Windows Reporter: Vicente
I am trying to convert PDF to TXT and some PDF, after converted, the String present wrong character. Could be UNICODE problem ? Can somebody help me ? the code public class PDFTextParser { PDFParser parser; String parsedText; PDFTextStripper pdfStripper; PDDocument pdDoc; COSDocument cosDoc; PDDocumentInformation pdDocInfo; // PDFTextParser Constructor public PDFTextParser() { } // Extract text from PDF Document public String pdftoText(String fileName) { System.out.println("Parsing text from PDF file " + fileName + "...."); File f = new File(fileName); if (!f.isFile()) { System.out.println("File " + fileName + " does not exist."); return null; } try { parser = new PDFParser(new FileInputStream(f)); } catch (Exception e) { System.out.println("Unable to open PDF Parser."); return null; } try { parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); parsedText = pdfStripper.getText(pdDoc); } catch (Exception e) { System.out.println("An exception occured in parsing the PDF Document."); e.printStackTrace(); try { if (cosDoc != null) cosDoc.close(); if (pdDoc != null) pdDoc.close(); } catch (Exception e1) { e.printStackTrace(); } return null; } System.out.println("Done."); return parsedText; } -- This message was sent by Atlassian JIRA (v6.1.5#6160)