[jira] [Updated] (PDFBOX-1744) Be resilient to PDFs with missing version info

Chris Bamford (JIRA) Thu, 10 Oct 2013 04:35:25 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Bamford updated PDFBOX-1744:
----------------------------------

    Description: 
Proposed addition to 1.8.2 -> 
pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java -> 
parseHeader() to default the PDF version to 1.4 in situations where it is 
missing (yes, there really are docs out there like this!).
This prevents an exception caused from a negative substring offset calculation: 
 "String index out of range: -3"

I have floated the question on the [email protected] mailing list (10th 
October 2013) and it was suggested I default the PDF version to 1.4 in this 
scenario.  I have tested it locally and it works (apparently PDFBox doesn't 
take the version number into account anyway).

Now over to you guys to decide if this is a good idea or not in the wider scope.

Should you give the green light, I attach:
1) a sample file which causes the exception
2) a patch file
3) patching instructions.

My goal is text extraction, even on broken files (if possible).

  was:
Proposed addition to 1.8.2 -> 
pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java -> 
parseHeader() to default the PDF version to 1.4 in situations where it is 
missing (yes, there really are docs out there like this!).
This prevents an exception caused from a negative substring offset calculation: 
 "String index out of range: -3"

I have floated the question on the [email protected] mailing list (10th 
October 2013) and it was suggested I default the PDF version to 1.4 in this 
scenario.  I have tested it locally and it works (apparently PDFBox doesn't 
take the version number into account anyway).

Now over to you guys to decide if this is a good idea or not in the wider scope.

Should you give the green light, I attach:
# a sample file which causes the exception
# a patch file + instructions.

My goal is text extraction, even on broken files (if possible).


> Be resilient to PDFs with missing version info
> ----------------------------------------------
>
>                 Key: PDFBOX-1744
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1744
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.8.2
>         Environment: PDFBox 1.8.2, IntelliJ IDEA 12.1.6, Mac OS X 10.7.5, 
> Java 1.7, Maven 2.2.1
>            Reporter: Chris Bamford
>            Priority: Minor
>             Fix For: 1.8.3
>
>         Attachments: no_version.pdf, pdfbox.patch
>
>
> Proposed addition to 1.8.2 -> 
> pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java -> 
> parseHeader() to default the PDF version to 1.4 in situations where it is 
> missing (yes, there really are docs out there like this!).
> This prevents an exception caused from a negative substring offset 
> calculation:  "String index out of range: -3"
> I have floated the question on the [email protected] mailing list (10th 
> October 2013) and it was suggested I default the PDF version to 1.4 in this 
> scenario.  I have tested it locally and it works (apparently PDFBox doesn't 
> take the version number into account anyway).
> Now over to you guys to decide if this is a good idea or not in the wider 
> scope.
> Should you give the green light, I attach:
> 1) a sample file which causes the exception
> 2) a patch file
> 3) patching instructions.
> My goal is text extraction, even on broken files (if possible).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (PDFBOX-1744) Be resilient to PDFs with missing version info

Reply via email to