[ https://issues.apache.org/jira/browse/PDFBOX-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vicente updated PDFBOX-1956: ---------------------------- Description: I am trying to convert PDF to TXT and some PDF, after converted, the String present wrong character. Could be UNICODE problem ? Can somebody help me ? I oberved that the problem when try to convert PDF, created by PDFCreator, in Text. The character are wrong. Any suggesting ? the code public class PDFTextParser { PDFParser parser; String parsedText; PDFTextStripper pdfStripper; PDDocument pdDoc; COSDocument cosDoc; PDDocumentInformation pdDocInfo; // PDFTextParser Constructor public PDFTextParser() { } // Extract text from PDF Document public String pdftoText(String fileName) { System.out.println("Parsing text from PDF file " + fileName + "...."); File f = new File(fileName); if (!f.isFile()) { System.out.println("File " + fileName + " does not exist."); return null; } try { parser = new PDFParser(new FileInputStream(f)); } catch (Exception e) { System.out.println("Unable to open PDF Parser."); return null; } try { parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); parsedText = pdfStripper.getText(pdDoc); } catch (Exception e) { System.out.println("An exception occured in parsing the PDF Document."); e.printStackTrace(); try { if (cosDoc != null) cosDoc.close(); if (pdDoc != null) pdDoc.close(); } catch (Exception e1) { e.printStackTrace(); } return null; } System.out.println("Done."); return parsedText; } was: I am trying to convert PDF to TXT and some PDF, after converted, the String present wrong character. Could be UNICODE problem ? Can somebody help me ? the code public class PDFTextParser { PDFParser parser; String parsedText; PDFTextStripper pdfStripper; PDDocument pdDoc; COSDocument cosDoc; PDDocumentInformation pdDocInfo; // PDFTextParser Constructor public PDFTextParser() { } // Extract text from PDF Document public String pdftoText(String fileName) { System.out.println("Parsing text from PDF file " + fileName + "...."); File f = new File(fileName); if (!f.isFile()) { System.out.println("File " + fileName + " does not exist."); return null; } try { parser = new PDFParser(new FileInputStream(f)); } catch (Exception e) { System.out.println("Unable to open PDF Parser."); return null; } try { parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); parsedText = pdfStripper.getText(pdDoc); } catch (Exception e) { System.out.println("An exception occured in parsing the PDF Document."); e.printStackTrace(); try { if (cosDoc != null) cosDoc.close(); if (pdDoc != null) pdDoc.close(); } catch (Exception e1) { e.printStackTrace(); } return null; } System.out.println("Done."); return parsedText; } > Wrong character on conversion PDF to TXT > ---------------------------------------- > > Key: PDFBOX-1956 > URL: https://issues.apache.org/jira/browse/PDFBOX-1956 > Project: PDFBox > Issue Type: Task > Components: Parsing > Affects Versions: 1.8.4 > Environment: Windows > Reporter: Vicente > Labels: parser > > I am trying to convert PDF to TXT and some PDF, after converted, the String > present wrong character. Could be UNICODE problem ? Can somebody help me ? > I oberved that the problem when try to convert PDF, created by PDFCreator, in > Text. The character are wrong. Any suggesting ? > the code > public class PDFTextParser { > > PDFParser parser; > String parsedText; > PDFTextStripper pdfStripper; > PDDocument pdDoc; > COSDocument cosDoc; > PDDocumentInformation pdDocInfo; > > // PDFTextParser Constructor > public PDFTextParser() { > } > > // Extract text from PDF Document > public String pdftoText(String fileName) { > > System.out.println("Parsing text from PDF file " + fileName + "...."); > File f = new File(fileName); > > if (!f.isFile()) { > System.out.println("File " + fileName + " does not exist."); > return null; > } > > try { > parser = new PDFParser(new FileInputStream(f)); > } catch (Exception e) { > System.out.println("Unable to open PDF Parser."); > return null; > } > > try { > parser.parse(); > cosDoc = parser.getDocument(); > pdfStripper = new PDFTextStripper(); > pdDoc = new PDDocument(cosDoc); > parsedText = pdfStripper.getText(pdDoc); > } catch (Exception e) { > System.out.println("An exception occured in parsing the PDF > Document."); > e.printStackTrace(); > try { > if (cosDoc != null) cosDoc.close(); > if (pdDoc != null) pdDoc.close(); > } catch (Exception e1) { > e.printStackTrace(); > } > return null; > } > System.out.println("Done."); > return parsedText; > } > -- This message was sent by Atlassian JIRA (v6.1.5#6160)