Dear all,
As part of my research, I am trying to convert pdf files to text files. I have
applied both itext and pdfbox but I encounter the same issue.
When I try extracting text from dnm1.pdf file (attached) both approaches work
well. However when applying them for dnm2.pdf they fail.
I retrieve a text file with full of NULL values. Is it normal for such
differently shaped pdfs or am I missing something else?
Thanks in advance.
Regards,
Mehmet
My code:
package retrievingfulltetxsfromweb;
import connectingurl.PlacesApi;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class PdfBox {
// Extract text from PDF Document
public PdfBox(String fileName) {
//PDFParser parser = new PDFParser();
String parsedText = null;;
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(fileName);
if (!file.isFile()) {
System.err.println("File " + fileName + " does not
exist.");
//return null;
}
try {
PDFParser parser = new PDFParser(new
FileInputStream(file));
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " +
e.getMessage());
//return null;
}
try {
PDFParser parser = new PDFParser(new
FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (Exception e) {
System.err
.println("An exception occured in
parsing the PDF Document."
+ e.getMessage());
} finally {
try {
if (cosDoc != null)
cosDoc.close();
if (pdDoc != null)
pdDoc.close();
} catch (Exception e) {
e.printStackTrace();
}
}
//return parsedText;
}
public static void main(String args[]){
PdfBox pdf = new PdfBox("C:/dnm1.pdf");
// System.out.println(pdftoText("C:/dnm1.pdf"));
}
}