[ https://issues.apache.org/jira/browse/PDFBOX-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Bamford updated PDFBOX-1744: ---------------------------------- Description: Proposed addition to 1.8.2 -> pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java -> parseHeader() to default the PDF version to 1.4 in situations where it is missing (yes, there really are docs out there like this!). This prevents an exception caused from a negative substring offset calculation: "String index out of range: -3" I have floated the question on the us...@pdfbox.apache.org mailing list (10th October 2013) and it was suggested I default the PDF version to 1.4 in this scenario. I have tested it locally and it works (apparently PDFBox doesn't take the version number into account anyway). Now over to you guys to decide if this is a good idea or not in the wider scope. Should you give the green light, I attach: 1) a sample file which causes the exception 2) a patch file 3) patching instructions. My goal is text extraction, even on broken files (if possible). was: Proposed addition to 1.8.2 -> pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java -> parseHeader() to default the PDF version to 1.4 in situations where it is missing (yes, there really are docs out there like this!). This prevents an exception caused from a negative substring offset calculation: "String index out of range: -3" I have floated the question on the us...@pdfbox.apache.org mailing list (10th October 2013) and it was suggested I default the PDF version to 1.4 in this scenario. I have tested it locally and it works (apparently PDFBox doesn't take the version number into account anyway). Now over to you guys to decide if this is a good idea or not in the wider scope. Should you give the green light, I attach: # a sample file which causes the exception # a patch file + instructions. My goal is text extraction, even on broken files (if possible). > Be resilient to PDFs with missing version info > ---------------------------------------------- > > Key: PDFBOX-1744 > URL: https://issues.apache.org/jira/browse/PDFBOX-1744 > Project: PDFBox > Issue Type: Improvement > Components: Parsing > Affects Versions: 1.8.2 > Environment: PDFBox 1.8.2, IntelliJ IDEA 12.1.6, Mac OS X 10.7.5, > Java 1.7, Maven 2.2.1 > Reporter: Chris Bamford > Priority: Minor > Fix For: 1.8.3 > > Attachments: no_version.pdf, pdfbox.patch > > > Proposed addition to 1.8.2 -> > pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java -> > parseHeader() to default the PDF version to 1.4 in situations where it is > missing (yes, there really are docs out there like this!). > This prevents an exception caused from a negative substring offset > calculation: "String index out of range: -3" > I have floated the question on the us...@pdfbox.apache.org mailing list (10th > October 2013) and it was suggested I default the PDF version to 1.4 in this > scenario. I have tested it locally and it works (apparently PDFBox doesn't > take the version number into account anyway). > Now over to you guys to decide if this is a good idea or not in the wider > scope. > Should you give the green light, I attach: > 1) a sample file which causes the exception > 2) a patch file > 3) patching instructions. > My goal is text extraction, even on broken files (if possible). -- This message was sent by Atlassian JIRA (v6.1#6144)