[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

Tim Allison (Jira) Fri, 27 Sep 2019 06:59:17 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939478#comment-16939478
 ]


Tim Allison commented on NUTCH-2457:
------------------------------------

The issue is that the AutoDetectParser automatically/silently adds itself as a 
parser to the ParseContext.  When an embedded document is parsed, there's a 
lookup for the embedded parser in the ParseContext.  Because you weren't using 
the AutoDetectParser, there is no parser in ParseContext, and the embedded 
documents are not being parsed.

So, you have two options (maybe more...):

1) use the AutoDetectParser; set 
https://tika.apache.org/1.17/api/org/apache/tika/metadata/TikaCoreProperties.html#CONTENT_TYPE_OVERRIDE
 to the mime, and you'll avoid a second detection for the container file

2) Use your current method, but add a cached AutoDetectParser to the 
ParseContext

> Embedded documents likely not correctly parsed by Tika
> ------------------------------------------------------
>
>                 Key: NUTCH-2457
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2457
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.14
>            Reporter: Tim Allison
>            Priority: Major
>             Fix For: 1.16
>
>
> While working on TIKA-2490, I think I found that Nutch's current method of 
> requesting a mime-specific parser for each file will fail to parse embedded 
> files, e.g. 
> https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx
> The fix should be straightforward, and I'll submit a PR once I can get Nutch 
> up and running in my dev environment. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

Reply via email to