Josh Burchard created TIKA-3422:
-----------------------------------

             Summary: Excluding both WMFParser and EMFParser causes wmf 
instances NOT to appear at all
                 Key: TIKA-3422
                 URL: https://issues.apache.org/jira/browse/TIKA-3422
             Project: Tika
          Issue Type: Bug
          Components: core
    Affects Versions: 1.26
            Reporter: Josh Burchard
         Attachments: tika-config_no_emf_or_wmf.xml, tika-config_no_wmf.xml

I was attempting to exclude embedded wmf and emf files from being parsed, but I 
noticed that when I do so, only instances of EMF files are noted by Tika in the 
returned /rmeta/text

As an experiment I created two tika-config.xml files. The first excludes only 
the WMFParser, and when my MSWord source doc is processed I see lines like 
this, as expected:

{{"Content-Type":"image/wmf","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.EmptyParser"]}}

And there are the EMF files that were found and parsed by the EMFParser:

{{"Content-Type":"image/emf","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.DefaultParser","org.apache.tika.parser.microsoft.EMFParser"]}}

 

A problem arises though when I try to exclude WMFParser AND EMFParser. Suddenly 
any WMF instances have disappeared and only EMF instances are shown as being 
handled by the EmptyParser. 

{{"Content-Type":"image/emf","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.EmptyParser"]}}

 

I think in the 2nd case BOTH types should be shown as being handled by the 
EmptyParser. I still want to know that the WMF files are in the container even 
though I'm not parsing them.

 

P.S. For whatever reason I can't upload the original Word doc that I'm testing 
with. Jira won't allow me.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to