[
https://issues.apache.org/jira/browse/TIKA-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Keith R. Bennett updated TIKA-16:
---------------------------------
Attachment: testPDF.PDF
Here is a sample PDF file I found in a nutch directory.
> Issues with data files used for testing by TestParsers.
> -------------------------------------------------------
>
> Key: TIKA-16
> URL: https://issues.apache.org/jira/browse/TIKA-16
> Project: Tika
> Issue Type: Bug
> Reporter: Keith R. Bennett
> Attachments: testPDF.PDF
>
>
> The TestParsers class requires that the following test files be available:
> testPDF.PDF
> testTXT.TXT
> testRTF.RTF
> textXML.XML
> testPPT.PPT
> testWORD.doc
> testEXCEL.xls
> testOO2.odt
> testHTML.html
> Only the following are provided:
> testHTML.html
> testRTF.rtf
> testTXT.txt
> testXML.xml
> Issue #1: When specifying the file names in the source code and for the files
> themselves, we should make sure that they are equal case-sensitively. My
> personal preference is for lower case extensions, but in any case they should
> be consistent. Therefore, where necessary, we need to rename the file or
> change the source code so that they match. (This is the case for *.rtf,
> *.txt, *.xml above.)
> Issue #2: The missing files need to be added to the repository so that the
> test does not fail. A minimal file would suffice for the short term, but
> ultimately it would be nice to have files that exercise the parsers to the
> fullest.
> Issue #3: We need to agree on a directory. The source code currently
> specifies that they should be in src/test/resources/testFiles, but in the
> respository they are in src/test/resources.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.