CompositeParser should indicate which parser was actually selected for parsing
------------------------------------------------------------------------------

                 Key: TIKA-674
                 URL: https://issues.apache.org/jira/browse/TIKA-674
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.0
            Reporter: Andrzej Bialecki 


If multiple parsers exist that support the same mime type, and AutoDetectParser 
(or another CompositeParser) is used, then the parse output does not indicate 
which of the alternative parsers was actually used. I think that the name of 
the parser (FQCN?) should be added to the metadata.

Something like this trivial patch:

{code}
Index: tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java
===================================================================
--- tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java 
(revision 1135167)
+++ tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java 
(working copy)
@@ -238,6 +238,7 @@
         try {
             TikaInputStream taggedStream = TikaInputStream.get(stream, tmp);
             TaggedContentHandler taggedHandler = new 
TaggedContentHandler(handler);
+            metadata.add("X-Parsed-By", parser.getClass().getName());
             try {
                 parser.parse(taggedStream, taggedHandler, metadata, context);
             } catch (RuntimeException e) {
{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to