[
https://issues.apache.org/jira/browse/TIKA-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-363.
--------------------------------
Resolution: Duplicate
Fix Version/s: 0.6
Assignee: Jukka Zitting
I believe the RDF pattern is being triggered by the embedded XMP metadata
included in the PDF file. I tested this with Tika 0.6 where the file is
correctly detected as application/pdf, so I believe the problem has already
been solved as a part of another issue.
> PDF Content Type seen as application/rdf+xml not appliction/pdf
> ---------------------------------------------------------------
>
> Key: TIKA-363
> URL: https://issues.apache.org/jira/browse/TIKA-363
> Project: Tika
> Issue Type: Bug
> Affects Versions: 0.5
> Environment: JDK 1.5, Windows XP, Adobe Acrobat Pro 8, Luke 0.9.9,
> tika-app-0.5.jar, Eclipse 4.2, Lucene In Action, Second source code
> TikaIndexer.java
> Reporter: Tim Reynolds
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 0.6
>
> Attachments: TikaData.zip
>
>
> I am using TikaIndexer.java from the source code of Lucene In Action Second
> Edition
> to index pdf files. Most PDF files work fine as verified by Luke (0.9.9),
> some files show
> content type of application/rdf+xml not appliction/pdf, and thus show no meta
> data in Luke
> The pdf files that show application/rdf+xml were opened via Adobe Acrobat
> Pro 8.
> Highlights/Bookmarks and Notes were added to the files, this was done several
> times
> with many saves. Acrobat can read these files without problem.
> The original pdfs, show application/pdf, the modified files show
> application/rdf+xml.
> If I open the pdf files via my editor VIM, I do see some CR +LF strangeness.
> Both the good & "bad" files have
> 0000000: 2550 4446 2d31 2e36 0d25 e2e3 cfd3 0d0a %PDF-1.6.%......
> for the first line, but the "bad" file doesn't have another $0d0a until
> 0001210: 6574 2065 6e64 3d22 7722 3f3e 0d0a 656e et end="w"?>..en
> up until that point I do see some 0d (CR) but no CR+LF. It is maybe the case
> that
> something is getting confused because it sees this very long line. Why the
> file
> stops using CR+LF I don't know. I assume this confusion then leads Tika to
> guess
> this is an rdf+xml file.
> I see the following bug in Tika: Mime type application/rdf+xml not correctly
> detected
> [#TIKA-309], but it says it is fixed in 0.5 which I am using.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.