Chris Bamford created PDFBOX-1744:
-------------------------------------
Summary: Be resilient to PDFs with missing version info
Key: PDFBOX-1744
URL: https://issues.apache.org/jira/browse/PDFBOX-1744
Project: PDFBox
Issue Type: Improvement
Components: Parsing
Affects Versions: 1.8.2
Environment: PDFBox 1.8.2, IntelliJ IDEA 12.1.6, Mac OS X 10.7.5, Java
1.7, Maven 2.2.1
Reporter: Chris Bamford
Priority: Minor
Fix For: 1.8.3
Proposed addition to 1.8.2 ->
pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java ->
parseHeader() to default the PDF version to 1.4 in situations where it is
missing (yes, there really are docs out there like this!).
This prevents an exception caused from a negative substring offset calculation:
"String index out of range: -3"
I have floated the question on the [email protected] mailing list (10th
October 2013) and it was suggested I default the PDF version to 1.4 in this
scenario. I have tested it locally and it works (apparently PDFBox doesn't
take the version number into account anyway).
Now over to you guys to decide if this is a good idea or not in the wider scope.
Should you give the green light, I attach:
# a sample file which causes the exception
# a patch file + instructions.
My goal is text extraction, even on broken files (if possible).
--
This message was sent by Atlassian JIRA
(v6.1#6144)