[
https://issues.apache.org/jira/browse/PDFBOX-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709455#action_12709455
]
Sean Bridges edited comment on PDFBOX-465 at 5/14/09 9:14 AM:
--------------------------------------------------------------
I'm also getting,
Caused by: java.io.IOException: Error: Invalid date format 'P8''
at
org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:157)
at
org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:784)
at
org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
at
message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:63)
... 2 more
The pdf is invalid,
/CreationDate (P8)
It looks like they are trying to utf-16 encode the meta data properties for
some reason.
removing the length check fixes it,
date = date.substring( 2, date.length() );
}
if( date.length() < 4 )
- {
- throw new IOException( "Error: Invalid date format '" +
date + "'" );
+ {
+ return null;
}
year = Integer.parseInt( date.substring( 0, 4 ) );
if( date.length() >= 6 )
I'm not attaching the diffs as a file since my copy of the code has so many
changes now you won't be able to simply apply the diff. Most of my changes are
trivial.
was (Author: sgbridges):
I'm also getting,
Caused by: java.io.IOException: Error: Invalid date format 'P8'
at
org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:157)
at
org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:784)
at
org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
at
message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:63)
... 2 more
The pdf is invalid,
removing the length check fixes it,
date = date.substring( 2, date.length() );
}
if( date.length() < 4 )
- {
- throw new IOException( "Error: Invalid date format '" +
date + "'" );
+ {
+ return null;
}
year = Integer.parseInt( date.substring( 0, 4 ) );
if( date.length() >= 6 )
I'm not attaching the diffs as a file since my copy of the code has so many
changes now you won't be able to simply apply the diff. Most of my changes are
trivial.
> invalid date formats
> ---------------------
>
> Key: PDFBOX-465
> URL: https://issues.apache.org/jira/browse/PDFBOX-465
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Reporter: Sean Bridges
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of
> ways, I got 94 invalid date formats,
> It seems that all of these have the stack trace of,
> Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
> at
> org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
> at
> org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
> at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
> at
> org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
> at
> message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)
> Some examples of invalid dates are,
> 20070430193647+713'00'
> Tue Aug 21 10:35:22 2007
> Tuesday, November 04, 2008
> 200712172:2:3
> Unknown
> 20090319 200122
> 9:47 5/12/2008
> i don't think there is any hope of parsing all these date formats. If would
> be nice if this was not a fatal error, and the parser could continue without
> a creation date.
> Is the policy of pdfbox to be as forgiving as possible when reading pdf
> documents? Maybe toCalendar should return a new Calendar() if parsing fails,
> rather than throwing.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.