[
https://issues.apache.org/jira/browse/PDFBOX-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vicente updated PDFBOX-1956:
----------------------------
Description:
I am trying to convert PDF to TXT and some PDF, after converted, the String
present wrong character. Could be UNICODE problem ? Can somebody help me ?
I oberved that the problem when try to convert PDF, created by PDFCreator, in
Text. The character are wrong. Any suggesting ?
the code
public class PDFTextParser {
PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
// PDFTextParser Constructor
public PDFTextParser() {
}
// Extract text from PDF Document
public String pdftoText(String fileName) {
System.out.println("Parsing text from PDF file " + fileName + "....");
File f = new File(fileName);
if (!f.isFile()) {
System.out.println("File " + fileName + " does not exist.");
return null;
}
try {
parser = new PDFParser(new FileInputStream(f));
} catch (Exception e) {
System.out.println("Unable to open PDF Parser.");
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
} catch (Exception e) {
System.out.println("An exception occured in parsing the PDF
Document.");
e.printStackTrace();
try {
if (cosDoc != null) cosDoc.close();
if (pdDoc != null) pdDoc.close();
} catch (Exception e1) {
e.printStackTrace();
}
return null;
}
System.out.println("Done.");
return parsedText;
}
was:
I am trying to convert PDF to TXT and some PDF, after converted, the String
present wrong character. Could be UNICODE problem ? Can somebody help me ?
the code
public class PDFTextParser {
PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
// PDFTextParser Constructor
public PDFTextParser() {
}
// Extract text from PDF Document
public String pdftoText(String fileName) {
System.out.println("Parsing text from PDF file " + fileName + "....");
File f = new File(fileName);
if (!f.isFile()) {
System.out.println("File " + fileName + " does not exist.");
return null;
}
try {
parser = new PDFParser(new FileInputStream(f));
} catch (Exception e) {
System.out.println("Unable to open PDF Parser.");
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
} catch (Exception e) {
System.out.println("An exception occured in parsing the PDF
Document.");
e.printStackTrace();
try {
if (cosDoc != null) cosDoc.close();
if (pdDoc != null) pdDoc.close();
} catch (Exception e1) {
e.printStackTrace();
}
return null;
}
System.out.println("Done.");
return parsedText;
}
> Wrong character on conversion PDF to TXT
> ----------------------------------------
>
> Key: PDFBOX-1956
> URL: https://issues.apache.org/jira/browse/PDFBOX-1956
> Project: PDFBox
> Issue Type: Task
> Components: Parsing
> Affects Versions: 1.8.4
> Environment: Windows
> Reporter: Vicente
> Labels: parser
>
> I am trying to convert PDF to TXT and some PDF, after converted, the String
> present wrong character. Could be UNICODE problem ? Can somebody help me ?
> I oberved that the problem when try to convert PDF, created by PDFCreator, in
> Text. The character are wrong. Any suggesting ?
> the code
> public class PDFTextParser {
>
> PDFParser parser;
> String parsedText;
> PDFTextStripper pdfStripper;
> PDDocument pdDoc;
> COSDocument cosDoc;
> PDDocumentInformation pdDocInfo;
>
> // PDFTextParser Constructor
> public PDFTextParser() {
> }
>
> // Extract text from PDF Document
> public String pdftoText(String fileName) {
>
> System.out.println("Parsing text from PDF file " + fileName + "....");
> File f = new File(fileName);
>
> if (!f.isFile()) {
> System.out.println("File " + fileName + " does not exist.");
> return null;
> }
>
> try {
> parser = new PDFParser(new FileInputStream(f));
> } catch (Exception e) {
> System.out.println("Unable to open PDF Parser.");
> return null;
> }
>
> try {
> parser.parse();
> cosDoc = parser.getDocument();
> pdfStripper = new PDFTextStripper();
> pdDoc = new PDDocument(cosDoc);
> parsedText = pdfStripper.getText(pdDoc);
> } catch (Exception e) {
> System.out.println("An exception occured in parsing the PDF
> Document.");
> e.printStackTrace();
> try {
> if (cosDoc != null) cosDoc.close();
> if (pdDoc != null) pdDoc.close();
> } catch (Exception e1) {
> e.printStackTrace();
> }
> return null;
> }
> System.out.println("Done.");
> return parsedText;
> }
>
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)