Hello,
I'm using pdfbox to go through a list of pdfs and attempt to extract a
phrase out of the files. Thus far, everything has been working great
until I scaled it up to use a large list of files. I am running into a
problem when there is a list of 112 files. After successfully going
through a handful of them (maybe 20), subsequent files are not able to
be opened. I have tried using the same method on just the offending
files one at a time, and they are able to be opened.
Here is the method that is being problematic, which is called for each pdf file:
PDDocument doc = null;
String regExDoi = "[Dd][Oo][Ii]:[0-9\\s]*\\.[\\
n\\r0-9]*/[A-Za-z0-9\\.\\-;\\(\\)/]*";
String regExDoiSplit = "[Dd][Oo][Ii]:";
Pattern findDoiString = Pattern.compile(regExDoi);
try {
try {
doc = PDDocument.load(file);
System.out.println("======= "+file.getName()+" loaded
========");
decrypt(doc);
if (!isFailedFile) {
PDFTextStripper strip = new PDFTextStripper();
int pageCount = doc.getNumberOfPages();
System.out.println("Pages: "+pageCount);
for (int page = 1; page < pageCount; page++) {
// restrict pdftextstripper to current page
strip.setStartPage(page);
strip.setEndPage(page);
// get text on page
String text = strip.getText(doc);
// try to find the doi string
Matcher m = findDoiString.matcher(text);
if (m.find()) {
String foundGroup = m.group();
String foundIt[] = foundGroup.split(regExDoiSplit);
// split at regexDoiSplit, should be
String[] = {"", "the doi numbers"}
if (foundIt.length > 0) {
System.out.println("\tDOI: '"+foundIt[1]+"'");
if (doc != null) {
System.out.println("Closing
document, found doi.");
doc.close();
}
// return the doi numbers, stripping
any white space
return foundIt[1].replaceAll("[\\s]*", "");
}
}
}
} else System.out.println(isFailedFile +
failedReason.toString());
} finally {
if (doc != null) {
doc.close();
}
}
} catch (IOException e) {
isFailedFile = true;
failedReason = FailedReason.BADFILE;
if (doc != null) {
doc.close();
}
}
if (doc != null) {
doc.close();
System.out.println("close it again");
}
return null
}
I think the problem is arising because I keep getting a "warning, you
did not close the pdf" and in such a long list, after getting that
warning so many times, it won't open the files anymore. I thought I
closed the document at all points that needed to be closed, did I
forget something else? Thank you.
-Sophia
--
~~~~~~~~~~~~~~~~~~~~~~~~~
Aim for the moon. If you miss, you may hit a star. -W. Clement Stone