[ https://issues.apache.org/jira/browse/TIKA-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783431#comment-17783431 ]
Sam Stephens commented on TIKA-4167: ------------------------------------ Thanks Tim. I don't have a use case where I need this behavior. But I had a test for the invariant that when CONTENT_TYPE_USER_OVERRIDE is provided, the "Content-Type" in the returned metadata reflects that, and I was surprised to see that invariant violated. I think this is probably a documentation issue. Reading https://tika.apache.org/2.9.1/detection.html#The_Detector_Interface, I didn't understand that CONTENT_TYPE_USER_OVERRIDE only applies to detection, but the parser the override selects then has freedom to chose the content type it thinks is most appropriate. You could consider making this a little clearer in that documentation. For my specific use case, regardless of whether my incoming file has an explicit content type use for CONTENT_TYPE_USER_OVERRIDE, or whether I'm using auto-detection, I always want to detect PDFs as application/pdf, I have no interest in or use for the subtypes. But I think it's reasonable for that behavior to be something I implement by post-processing the returned Tika metadata (basically if Content-Type is application/illustrator, and dc:format starts with application/pdf, I use application/pdf as the content type). > CONTENT_TYPE_USER_OVERRIDE doesn't force content type for > application/illustrator files > --------------------------------------------------------------------------------------- > > Key: TIKA-4167 > URL: https://issues.apache.org/jira/browse/TIKA-4167 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.9.1 > Reporter: Sam Stephens > Priority: Minor > > When I parse a file using AutoDetectParser, with Metadata set to > {color:#ce9178}{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE: > "application/pdf"}{color} > and parse [a PDF-like Illustrator > file|[https://github.com/apache/tika/blob/78be82565df4cc3bbc88308be8d686019a10b899/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testPDF_AdobeIllustrator.pdf],] > the "Content-Type" in the returned metadata is "application/illustrator", > not "application/pdf". > I think this is happening because "application/illustrator" is a subtype of > "application/pdf". -- This message was sent by Atlassian Jira (v8.20.10#820010)