[jira] [Updated] (PDFBOX-1956) Wrong character on conversion PDF to TXT

Vicente (JIRA) Sun, 02 Mar 2014 04:32:16 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vicente updated PDFBOX-1956:
----------------------------

    Description: 
I am trying to convert PDF to TXT and some PDF, after converted, the String 
present wrong character. Could be UNICODE problem ? Can somebody help me ?

I oberved that the problem when try to convert PDF, created by PDFCreator, in 
Text. The character are wrong. Any suggesting ?

the code 


public class PDFTextParser {
    
    PDFParser parser;
    String parsedText;
    PDFTextStripper pdfStripper;
    PDDocument pdDoc;
    COSDocument cosDoc;
    PDDocumentInformation pdDocInfo;
    
    // PDFTextParser Constructor 
    public PDFTextParser() {
    }
    
    // Extract text from PDF Document
    public String pdftoText(String fileName) {
        
        System.out.println("Parsing text from PDF file " + fileName + "....");
        File f = new File(fileName);
        
        if (!f.isFile()) {
            System.out.println("File " + fileName + " does not exist.");
            return null;
        }
        
        try {
            parser = new PDFParser(new FileInputStream(f));
        } catch (Exception e) {
            System.out.println("Unable to open PDF Parser.");
            return null;
        }
        
        try {
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            parsedText = pdfStripper.getText(pdDoc); 
        } catch (Exception e) {
            System.out.println("An exception occured in parsing the PDF 
Document.");
            e.printStackTrace();
            try {
                   if (cosDoc != null) cosDoc.close();
                   if (pdDoc != null) pdDoc.close();
               } catch (Exception e1) {
               e.printStackTrace();
            }
            return null;
        }      
        System.out.println("Done.");
        return parsedText;
    }
    

  was:
I am trying to convert PDF to TXT and some PDF, after converted, the String 
present wrong character. Could be UNICODE problem ? Can somebody help me ?

the code 


public class PDFTextParser {
    
    PDFParser parser;
    String parsedText;
    PDFTextStripper pdfStripper;
    PDDocument pdDoc;
    COSDocument cosDoc;
    PDDocumentInformation pdDocInfo;
    
    // PDFTextParser Constructor 
    public PDFTextParser() {
    }
    
    // Extract text from PDF Document
    public String pdftoText(String fileName) {
        
        System.out.println("Parsing text from PDF file " + fileName + "....");
        File f = new File(fileName);
        
        if (!f.isFile()) {
            System.out.println("File " + fileName + " does not exist.");
            return null;
        }
        
        try {
            parser = new PDFParser(new FileInputStream(f));
        } catch (Exception e) {
            System.out.println("Unable to open PDF Parser.");
            return null;
        }
        
        try {
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            parsedText = pdfStripper.getText(pdDoc); 
        } catch (Exception e) {
            System.out.println("An exception occured in parsing the PDF 
Document.");
            e.printStackTrace();
            try {
                   if (cosDoc != null) cosDoc.close();
                   if (pdDoc != null) pdDoc.close();
               } catch (Exception e1) {
               e.printStackTrace();
            }
            return null;
        }      
        System.out.println("Done.");
        return parsedText;
    }
    


> Wrong character on conversion PDF to TXT
> ----------------------------------------
>
>                 Key: PDFBOX-1956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1956
>             Project: PDFBox
>          Issue Type: Task
>          Components: Parsing
>    Affects Versions: 1.8.4
>         Environment: Windows
>            Reporter: Vicente
>              Labels: parser
>
> I am trying to convert PDF to TXT and some PDF, after converted, the String 
> present wrong character. Could be UNICODE problem ? Can somebody help me ?
> I oberved that the problem when try to convert PDF, created by PDFCreator, in 
> Text. The character are wrong. Any suggesting ?
> the code 
> public class PDFTextParser {
>     
>     PDFParser parser;
>     String parsedText;
>     PDFTextStripper pdfStripper;
>     PDDocument pdDoc;
>     COSDocument cosDoc;
>     PDDocumentInformation pdDocInfo;
>     
>     // PDFTextParser Constructor 
>     public PDFTextParser() {
>     }
>     
>     // Extract text from PDF Document
>     public String pdftoText(String fileName) {
>         
>         System.out.println("Parsing text from PDF file " + fileName + "....");
>         File f = new File(fileName);
>         
>         if (!f.isFile()) {
>             System.out.println("File " + fileName + " does not exist.");
>             return null;
>         }
>         
>         try {
>             parser = new PDFParser(new FileInputStream(f));
>         } catch (Exception e) {
>             System.out.println("Unable to open PDF Parser.");
>             return null;
>         }
>         
>         try {
>             parser.parse();
>             cosDoc = parser.getDocument();
>             pdfStripper = new PDFTextStripper();
>             pdDoc = new PDDocument(cosDoc);
>             parsedText = pdfStripper.getText(pdDoc); 
>         } catch (Exception e) {
>             System.out.println("An exception occured in parsing the PDF 
> Document.");
>             e.printStackTrace();
>             try {
>                    if (cosDoc != null) cosDoc.close();
>                    if (pdDoc != null) pdDoc.close();
>                } catch (Exception e1) {
>                e.printStackTrace();
>             }
>             return null;
>         }      
>         System.out.println("Done.");
>         return parsedText;
>     }
>     



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PDFBOX-1956) Wrong character on conversion PDF to TXT

Reply via email to