PDF Content Type seen as application/rdf+xml not appliction/pdf
---------------------------------------------------------------
Key: TIKA-363
URL: https://issues.apache.org/jira/browse/TIKA-363
Project: Tika
Issue Type: Bug
Affects Versions: 0.5
Environment: JDK 1.5, Windows XP, Adobe Acrobat Pro 8, Luke 0.9.9,
tika-app-0.5.jar, Eclipse 4.2, Lucene In Action, Second source code
TikaIndexer.java
Reporter: Tim Reynolds
Priority: Minor
I am using TikaIndexer.java from the source code of Lucene In Action Second
Edition
to index pdf files. Most PDF files work fine as verified by Luke (0.9.9), some
files show
content type of application/rdf+xml not appliction/pdf, and thus show no meta
data in Luke
The pdf files that show application/rdf+xml were opened via Adobe Acrobat Pro
8.
Highlights/Bookmarks and Notes were added to the files, this was done several
times
with many saves. Acrobat can read these files without problem.
The original pdfs, show application/pdf, the modified files show
application/rdf+xml.
If I open the pdf files via my editor VIM, I do see some CR +LF strangeness.
Both the good & "bad" files have
0000000: 2550 4446 2d31 2e36 0d25 e2e3 cfd3 0d0a %PDF-1.6.%......
for the first line, but the "bad" file doesn't have another $0d0a until
0001210: 6574 2065 6e64 3d22 7722 3f3e 0d0a 656e et end="w"?>..en
up until that point I do see some 0d (CR) but no CR+LF. It is maybe the case
that
something is getting confused because it sees this very long line. Why the file
stops using CR+LF I don't know. I assume this confusion then leads Tika to guess
this is an rdf+xml file.
I see the following bug in Tika: Mime type application/rdf+xml not correctly
detected
[#TIKA-309], but it says it is fixed in 0.5 which I am using.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.