[ 
https://issues.apache.org/jira/browse/PDFBOX-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709455#action_12709455
 ] 

Sean Bridges edited comment on PDFBOX-465 at 5/14/09 9:14 AM:
--------------------------------------------------------------

I'm also getting,

Caused by: java.io.IOException: Error: Invalid date format 'P8''
        at 
org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:157)
        at 
org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
        at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:784)
        at 
org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
        at 
message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:63)
        ... 2 more

The pdf is invalid, 

/CreationDate (P8‘)

It looks like they are trying to utf-16 encode the meta data properties for 
some reason.


removing the length check fixes it,

                     date = date.substring( 2, date.length() );
                 }
                 if( date.length() < 4 )
-                {
-                    throw new IOException( "Error: Invalid date format '" + 
date + "'" );
+                {                    
+                    return null;
                 }
                 year = Integer.parseInt( date.substring( 0, 4 ) );
                 if( date.length() >= 6 )

I'm not attaching the diffs as a file since my copy of the code has so many 
changes now you won't be able to simply apply the diff.  Most of my changes are 
trivial.

      was (Author: sgbridges):
    I'm also getting,

Caused by: java.io.IOException: Error: Invalid date format 'P8‘'
        at 
org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:157)
        at 
org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
        at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:784)
        at 
org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
        at 
message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:63)
        ... 2 more

The pdf is invalid, 


removing the length check fixes it,

                     date = date.substring( 2, date.length() );
                 }
                 if( date.length() < 4 )
-                {
-                    throw new IOException( "Error: Invalid date format '" + 
date + "'" );
+                {                    
+                    return null;
                 }
                 year = Integer.parseInt( date.substring( 0, 4 ) );
                 if( date.length() >= 6 )

I'm not attaching the diffs as a file since my copy of the code has so many 
changes now you won't be able to simply apply the diff.  Most of my changes are 
trivial.
  
> invalid date formats 
> ---------------------
>
>                 Key: PDFBOX-465
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-465
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>            Reporter: Sean Bridges
>
> This is with the latest from svn, Revision: 773978
> From a sample of 13304 pdf documents generated in a very wide variety of 
> ways, I got 94 invalid date formats,
> It seems that all of these have the stack trace of,
> Caused by: java.io.IOException: Error converting date:Friday, July 11, 2008
>       at 
> org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:240)
>       at 
> org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:120)
>       at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:783)
>       at 
> org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:218)
>       at 
> message_analyzer.extractor.PDFExtractor.getContent(PDFExtractor.java:50)
> Some examples of invalid dates are,
> 20070430193647+713'00'
> Tue Aug 21 10:35:22 2007
> Tuesday, November 04, 2008
> 200712172:2:3 
> Unknown
> 20090319 200122
> 9:47 5/12/2008
> i don't think there is any hope of parsing all these date formats.  If would 
> be nice if this was not a fatal error, and the parser could continue without 
> a creation date. 
> Is the policy of pdfbox to be as forgiving as possible when reading pdf 
> documents?  Maybe toCalendar should return a new Calendar() if parsing fails, 
> rather than throwing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to