[ https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781922#comment-13781922 ]
Tim Allison commented on TIKA-1162: ----------------------------------- Dear Colleague, I'm on paternity leave. Will be back part time on October 14. Best, Tim > content-type/charset problem with RFC822Parser > ---------------------------------------------- > > Key: TIKA-1162 > URL: https://issues.apache.org/jira/browse/TIKA-1162 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Maciej Lizewski > > RFC822Parser (mime mail) uses MailContentHandler which internally uses > AutoDetectParser to handle each mime part. The problem is that > MailContentHandler reads mime part headers and sets CONTENT_TYPE and > CONTENT_ENCODING metadata properly and passes this metadata to > AutoDetectParser::parse method. But that method ignores those headers and > overwrites it: > MediaType type = this.getDetector().detect(tis, metadata); > metadata.set(Metadata.CONTENT_TYPE, type.toString()); > this leads to some additional recursion loops (Detector returns > message/rfc822 mime type instead of proper mimetype for current mime part) > and finally somehow it skips out of the loop but without proper content-type > and content-encoding headers... > My proposition is to add check if metadata already contains CONTENT_TYPE in > AutoDetectPArser::parse and in such case do not override it. If this is not > valid behavior in general - then RFC822Parser should use custom parser in > MailContentHandler which respects passed content-type... -- This message was sent by Atlassian JIRA (v6.1#6144)