[ https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939478#comment-16939478 ]
Tim Allison commented on NUTCH-2457: ------------------------------------ The issue is that the AutoDetectParser automatically/silently adds itself as a parser to the ParseContext. When an embedded document is parsed, there's a lookup for the embedded parser in the ParseContext. Because you weren't using the AutoDetectParser, there is no parser in ParseContext, and the embedded documents are not being parsed. So, you have two options (maybe more...): 1) use the AutoDetectParser; set https://tika.apache.org/1.17/api/org/apache/tika/metadata/TikaCoreProperties.html#CONTENT_TYPE_OVERRIDE to the mime, and you'll avoid a second detection for the container file 2) Use your current method, but add a cached AutoDetectParser to the ParseContext > Embedded documents likely not correctly parsed by Tika > ------------------------------------------------------ > > Key: NUTCH-2457 > URL: https://issues.apache.org/jira/browse/NUTCH-2457 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.14 > Reporter: Tim Allison > Priority: Major > Fix For: 1.16 > > > While working on TIKA-2490, I think I found that Nutch's current method of > requesting a mime-specific parser for each file will fail to parse embedded > files, e.g. > https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx > The fix should be straightforward, and I'll submit a PR once I can get Nutch > up and running in my dev environment. -- This message was sent by Atlassian Jira (v8.3.4#803005)