[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172527#comment-15172527
 ] 

Thamme Gowda N commented on TIKA-1663:
--------------------------------------

[~chrismattmann] [~talli...@mitre.org] We need SHA digest of raw content for 
MEMEX project.
I tried to enable digesting parser by editing our config file:
{code}
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DigestingParser">
            <parser class="org.apache.tika.parser.DefaultParser">
            </parser>
        </parser>
        .....
{code}

This doesnt work for the obvious reason that we havent told which digest 
algorithm.
After checking 
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java,
 I found that  DigestingParser is a flexible framwork and takes constructor 
args. 

So, I propose two options:
1. We offer few popular implementations like SHA, MD5 parsers which doesnt need 
constructor args. This will enable us to activate them by editing the config 
xml file instead of source code.
2. We enhance tika configuration framework and these flexible parsers to accept 
runtime arguments, so that the flexibility and ease of use is preserved. For 
instance, if we can supply digest algorithm name from config file and let the 
DigestingParser use it to instantiate, then we dont need to edit source code of 
applications.
{code}
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DigestingParser">
            <args>
                  <digest>MD5</digest>
           </args>
            <parser class="org.apache.tika.parser.DefaultParser">
            </parser>
        </parser>
        .....
{code}

I vote for option 2 even though it is slightly more work, but I feel it is the 
way to go.
I donot know if Tika already has a support for option 2 by accepting runtime 
arguments from config file.
 I faced a similar issue with NamedEntityParser, but found a workaround by 
using System properties.

> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> -------------------------------------------------------------------
>
>                 Key: TIKA-1663
>                 URL: https://issues.apache.org/jira/browse/TIKA-1663
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to