[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176652#comment-15176652
 ] 

Nick Burch commented on TIKA-1663:
----------------------------------

The other parser decorators are specified with options inside the parent 
parser, eg mime includes or excludes are decorators given as options to the 
main parser. In some ways, this is quite nice, as you do the main definition on 
the thing that'll do the work, then the decorators after

One option, for the general case, would be to add additional decorators too, eg 
http://tika.apache.org/1.12/configuring.html#Configuring_Parsers becomes
{code}
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <parser-exclude 
class="org.apache.tika.parser.executable.ExecutableParser"/>
      <decorator class="org.foo.bar.DecoratorWithEmojis"/>
      <decorator class="org.foo.bar.DecoratorWithHashing"/>
    </parser>
{code}

For the specific case of the digester, it's a well known thing, so we could 
give it custom tags. That would make things clearer, and would get round the 
parameter issue. One option is:
{code}
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <digest>MD5,SHA256</digest>
      <parser-exclude 
class="org.apache.tika.parser.executable.ExecutableParser"/>
    </parser>
{code}

The other to keep it more in line with the mime includes/excludes is:
{code}
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <digest>MD5</digest>
      <digest>SHA256</digest>
      <parser-exclude 
class="org.apache.tika.parser.executable.ExecutableParser"/>
    </parser>
{code}

What do people think?

> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> -------------------------------------------------------------------
>
>                 Key: TIKA-1663
>                 URL: https://issues.apache.org/jira/browse/TIKA-1663
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to