[ 
https://issues.apache.org/jira/browse/TIKA-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783431#comment-17783431
 ] 

Sam Stephens commented on TIKA-4167:
------------------------------------

Thanks Tim. I don't have a use case where I need this behavior. But I had a 
test for the invariant that when CONTENT_TYPE_USER_OVERRIDE is provided, the 
"Content-Type" in the returned metadata reflects that, and I was surprised to 
see that invariant violated.

I think this is probably a documentation issue. Reading 
https://tika.apache.org/2.9.1/detection.html#The_Detector_Interface, I didn't 
understand that CONTENT_TYPE_USER_OVERRIDE only applies to detection, but the 
parser the override selects then has freedom to chose the content type it 
thinks is most appropriate. You could consider making this a little clearer in 
that documentation.

For my specific use case, regardless of whether my incoming file has an 
explicit content type use for CONTENT_TYPE_USER_OVERRIDE, or whether I'm using 
auto-detection, I always want to detect PDFs as application/pdf, I have no 
interest in or use for the subtypes. But I think it's reasonable for that 
behavior to be something I implement by post-processing the returned Tika 
metadata (basically if Content-Type is application/illustrator, and dc:format 
starts with application/pdf, I use application/pdf as the content type).

> CONTENT_TYPE_USER_OVERRIDE doesn't force content type for 
> application/illustrator files
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-4167
>                 URL: https://issues.apache.org/jira/browse/TIKA-4167
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.9.1
>            Reporter: Sam Stephens
>            Priority: Minor
>
> When I parse a file using AutoDetectParser, with Metadata set to
> {color:#ce9178}{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE: 
> "application/pdf"}{color}
> and parse [a PDF-like Illustrator 
> file|[https://github.com/apache/tika/blob/78be82565df4cc3bbc88308be8d686019a10b899/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testPDF_AdobeIllustrator.pdf],]
>  the "Content-Type" in the returned metadata is "application/illustrator", 
> not "application/pdf".
> I think this is happening because "application/illustrator" is a subtype of 
> "application/pdf".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to