[jira] [Commented] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents

2013-12-10 Thread Thomas Chojecki (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844070#comment-13844070
 ] 

Thomas Chojecki commented on PDFBOX-1792:
-

Some files cause parsing exceptions. First I did not know if my project is 
missconfigured. After checking the offsets at which the parser stop working,I 
saw that some files are broken. The first one has only garbage after %%EOF. I'm 
at work so I can't give you much informations about the exact files and 
stacktraces.

Maybe we does not speak about the same test OR at your environment the test 
can't find any files? Can you check if the file array contains at least one 
testfile?

File dir = new File(src/test/resources/input);
for (File f : dir.listFiles()){
  if (f.getName().toLowerCase().endsWith(.pdf)){
testSingleFileEquality(f);
  }
}

Additionally I can't commit the three testfiles from the archive. See my mail 
at the dev mailing list.

 Different metadata extracted with NonSequentialPDFParser vs classic parser on 
 some documents
 

 Key: PDFBOX-1792
 URL: https://issues.apache.org/jira/browse/PDFBOX-1792
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.3
Reporter: Tim Allison
Priority: Minor
 Attachments: PDFBOX-1792.tar.gz, testPDF_acroForm2.pdf


 The traditional parser is able to extract metadata from a test document from 
 TIKA-738.  The NonSequentialPDFParser is not able to extract metadata from 
 that file.  Another file from the Tika test suite has metadata that can be 
 extracted by the NonSequentialPDFParser but not by classic. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents

2013-12-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844104#comment-13844104
 ] 

Andreas Lehmkühler commented on PDFBOX-1792:


The testcase you are talking about wasn't there in the first place. You added 
it when disabling it. Have a look at revision 1458423 before your checkin

http://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/test/java/org/apache/pdfbox/pdmodel/TestPDDocumentInformation.java?revision=1458423view=markup

The issue only exists in your local environment. Otherwise the jenkins build 
should have failed, but it didn't.

IMO you should revert your changes and once the issue with the other pdf and 
the parsing it is solved, we should (re)add the testcase and the sample pdf as 
well. But let's do that in the trunk first.

 Different metadata extracted with NonSequentialPDFParser vs classic parser on 
 some documents
 

 Key: PDFBOX-1792
 URL: https://issues.apache.org/jira/browse/PDFBOX-1792
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.3
Reporter: Tim Allison
Priority: Minor
 Attachments: PDFBOX-1792.tar.gz, testPDF_acroForm2.pdf


 The traditional parser is able to extract metadata from a test document from 
 TIKA-738.  The NonSequentialPDFParser is not able to extract metadata from 
 that file.  Another file from the Tika test suite has metadata that can be 
 extracted by the NonSequentialPDFParser but not by classic. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents

2013-12-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13844041#comment-13844041
 ] 

Andreas Lehmkühler commented on PDFBOX-1792:


Hmmm, why did you disable the test? Everything works fine for me.

 Different metadata extracted with NonSequentialPDFParser vs classic parser on 
 some documents
 

 Key: PDFBOX-1792
 URL: https://issues.apache.org/jira/browse/PDFBOX-1792
 Project: PDFBox
  Issue Type: Bug
  Components: PDModel
Affects Versions: 1.8.3
Reporter: Tim Allison
Priority: Minor
 Attachments: PDFBOX-1792.tar.gz, testPDF_acroForm2.pdf


 The traditional parser is able to extract metadata from a test document from 
 TIKA-738.  The NonSequentialPDFParser is not able to extract metadata from 
 that file.  Another file from the Tika test suite has metadata that can be 
 extracted by the NonSequentialPDFParser but not by classic. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)