[jira] [Commented] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents

Thomas Chojecki (JIRA) Tue, 10 Dec 2013 00:10:43 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844070#comment-13844070
 ]


Thomas Chojecki commented on PDFBOX-1792:
-----------------------------------------

Some files cause parsing exceptions. First I did not know if my project is 
missconfigured. After checking the offsets at which the parser stop working,I 
saw that some files are broken. The first one has only garbage after %%EOF. I'm 
at work so I can't give you much informations about the exact files and 
stacktraces.

Maybe we does not speak about the same test OR at your environment the test 
can't find any files? Can you check if the file array contains at least one 
testfile?

File dir = new File("src/test/resources/input");
for (File f : dir.listFiles()){
  if (f.getName().toLowerCase().endsWith(".pdf")){
    testSingleFileEquality(f);
  }
}

Additionally I can't commit the three testfiles from the archive. See my mail 
at the dev mailing list.

> Different metadata extracted with NonSequentialPDFParser vs classic parser on 
> some documents
> --------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1792
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1792
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.8.3
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: PDFBOX-1792.tar.gz, testPDF_acroForm2.pdf
>
>
> The traditional parser is able to extract metadata from a test document from 
> TIKA-738.  The NonSequentialPDFParser is not able to extract metadata from 
> that file.  Another file from the Tika test suite has metadata that can be 
> extracted by the NonSequentialPDFParser but not by classic. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents

Reply via email to