Re: Problem when extracting text from a pdf file

Maruan Sahyoun Fri, 16 May 2014 15:46:29 -0700

Hi Mehmet,

it could well be that text extraction works for one PDF and doesn’t for another 
as it might not contain real text but what you see on screen is drawn. As the 
attachments didn’t make it through because of restrictions on the mailing list 
could you upload these to a public location to take a look at the files so the 
answer can be more specific for your case?


BR

Maruan Sahyoun

Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu 
<[email protected]>:

> Dear all,
>  
> As part of my research, I am trying to convert pdf files to text files. I 
> have applied both itext and pdfbox but I encounter the same issue.
>  
> When I try extracting text from dnm1.pdf file (attached) both approaches work 
> well. However when applying them for dnm2.pdf they fail.
>  
> I retrieve a text file with full of NULL values. Is it normal for such 
> differently shaped pdfs or am I missing something else?
>  
> Thanks in advance.
>  
> Regards,
> Mehmet
>  
>  
> My code:
>  
> package retrievingfulltetxsfromweb;
>  
> import connectingurl.PlacesApi;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import org.apache.pdfbox.cos.COSDocument;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.util.PDFTextStripper;
>  
> public class PdfBox {
>    
>     // Extract text from PDF Document
>             public PdfBox(String fileName) {
>                     //PDFParser parser = new PDFParser();
>                     String parsedText = null;;
>                     PDFTextStripper pdfStripper = null;
>                     PDDocument pdDoc = null;
>                     COSDocument cosDoc = null;
>                     File file = new File(fileName);
>                     if (!file.isFile()) {
>                             System.err.println("File " + fileName + " does 
> not exist.");
>                             //return null;
>                     }
>                     try {
>                             PDFParser parser = new PDFParser(new 
> FileInputStream(file));
>                     } catch (IOException e) {
>                             System.err.println("Unable to open PDF Parser. " 
> + e.getMessage());
>                             //return null;
>                     }
>                     try {
>                             PDFParser parser = new PDFParser(new 
> FileInputStream(file));
>                             parser.parse();
>                             cosDoc = parser.getDocument();
>                             pdfStripper = new PDFTextStripper();
>                             pdDoc = new PDDocument(cosDoc);
>                             pdfStripper.setStartPage(1);
>                             pdfStripper.setEndPage(5);
>                             parsedText = pdfStripper.getText(pdDoc);
>                         System.out.println(parsedText);
>                     } catch (Exception e) {
>                             System.err
>                                             .println("An exception occured in 
> parsing the PDF Document."
>                                                             + e.getMessage());
>                     } finally {
>                             try {
>                                     if (cosDoc != null)
>                                             cosDoc.close();
>                                     if (pdDoc != null)
>                                             pdDoc.close();
>                             } catch (Exception e) {
>                                     e.printStackTrace();
>                             }
>                     }
>                     //return parsedText;
>             }
>             public static void main(String args[]){
>                    
>                 PdfBox pdf = new PdfBox("C:/dnm1.pdf");
>                    // System.out.println(pdftoText("C:/dnm1.pdf"));
>             }
>  
> }

Re: Problem when extracting text from a pdf file

Reply via email to