[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-03-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187290#comment-15187290
 ] 

Tim Allison commented on TIKA-1663:
---

[~gagravarr], am I right in that we cannot do this now:
{code}
 


 
{code}



> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> ---
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-03-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177851#comment-15177851
 ] 

Tim Allison commented on TIKA-1663:
---

Thank you, Nick.  I somewhat prefer the first option (once we add the parameter 
setting).  I'm hesitant to promote the DigestingParser (wrapper) to a special 
place, but I'm game if the community is.


> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> ---
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-03-02 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176652#comment-15176652
 ] 

Nick Burch commented on TIKA-1663:
--

The other parser decorators are specified with options inside the parent 
parser, eg mime includes or excludes are decorators given as options to the 
main parser. In some ways, this is quite nice, as you do the main definition on 
the thing that'll do the work, then the decorators after

One option, for the general case, would be to add additional decorators too, eg 
http://tika.apache.org/1.12/configuring.html#Configuring_Parsers becomes
{code}

  image/jpeg
  application/pdf
  
  
  

{code}

For the specific case of the digester, it's a well known thing, so we could 
give it custom tags. That would make things clearer, and would get round the 
parameter issue. One option is:
{code}

  image/jpeg
  application/pdf
  MD5,SHA256
  

{code}

The other to keep it more in line with the mime includes/excludes is:
{code}

  image/jpeg
  application/pdf
  MD5
  SHA256
  

{code}

What do people think?

> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> ---
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-02-29 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173179#comment-15173179
 ] 

Chris A. Mattmann commented on TIKA-1663:
-

thanks [~thammegowda] and [~talli...@apache.org] :)

> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> ---
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-02-29 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172679#comment-15172679
 ] 

Thamme Gowda N commented on TIKA-1663:
--

Yes, I like to work on TIKA-1508,  provided 6 to 8 days timeline from now.


> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> ---
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-02-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172626#comment-15172626
 ] 

Tim Allison commented on TIKA-1663:
---

In tika-batch/tika-app, I did a not-so-great-workaround with an interface for a 
ParserFactory, and then I hardcoded a parser factory that wrapped a 
DigestingParser around the AutoDetectParser, and then wrapped all of that in a 
RecursiveParserWrapper...not happy with that and look forward to being able to 
configure this via the config file.

> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> ---
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-02-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172605#comment-15172605
 ] 

Tim Allison commented on TIKA-1663:
---

Y, I much prefer #2.  The parameter part will be solved by TIKA-1508.  I 
haven't looked at that one in a while, though, but it is important.  Any 
interest in contributing?

Specifying which parser the digesting parser wraps...hmmm...y, it could be 
handled as you suggest. 

[~gagravarr], for parser wrappers (where one parser takes a delegate), any 
recommendation on how to specify this via config currently?



As a side note, beware of TIKA-1701 and truncated files!

> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> ---
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2016-02-29 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172527#comment-15172527
 ] 

Thamme Gowda N commented on TIKA-1663:
--

[~chrismattmann] [~talli...@mitre.org] We need SHA digest of raw content for 
MEMEX project.
I tried to enable digesting parser by editing our config file:
{code}






.
{code}

This doesnt work for the obvious reason that we havent told which digest 
algorithm.
After checking 
https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java,
 I found that  DigestingParser is a flexible framwork and takes constructor 
args. 

So, I propose two options:
1. We offer few popular implementations like SHA, MD5 parsers which doesnt need 
constructor args. This will enable us to activate them by editing the config 
xml file instead of source code.
2. We enhance tika configuration framework and these flexible parsers to accept 
runtime arguments, so that the flexibility and ease of use is preserved. For 
instance, if we can supply digest algorithm name from config file and let the 
DigestingParser use it to instantiate, then we dont need to edit source code of 
applications.
{code}




  MD5
   



.
{code}

I vote for option 2 even though it is slightly more work, but I feel it is the 
way to go.
I donot know if Tika already has a support for option 2 by accepting runtime 
arguments from config file.
 I faced a similar issue with NamedEntityParser, but found a workaround by 
using System properties.

> Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
> ---
>
> Key: TIKA-1663
> URL: https://issues.apache.org/jira/browse/TIKA-1663
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: digesting_parser_v1.patch
>
>
> It might be useful to integrate commons' DigestUtils and allow users to 
> easily add the MD5 or other supported hashes to the Metadata object.
> Anyone else find this of use?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata

2015-06-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604476#comment-14604476
 ] 

Hudson commented on TIKA-1663:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #769 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/769/])
TIKA-1663 add a DigestingParser (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1687981)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-app/src/main/java/org/apache/tika/batch
* 
/tika/trunk/tika-app/src/main/java/org/apache/tika/batch/DigestingAutoDetectParserFactory.java
* /tika/trunk/tika-app/src/main/java/org/apache/tika/batch/builders
* 
/tika/trunk/tika-app/src/main/java/org/apache/tika/batch/builders/AppParserFactoryBuilder.java
* /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* /tika/trunk/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java
* /tika/trunk/tika-app/src/main/resources/tika-app-batch-config.xml
* 
/tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchIntegrationTest.java
* /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
* /tika/trunk/tika-app/src/test/resources/log4j_batch_process_test.properties
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/AutoDetectParserFactory.java
* /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParserFactory.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/IParserFactoryBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/ParserFactoryBuilder.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java
* 
/tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/builders/BasicTikaFSConsumersBuilder.java
* 
/tika/trunk/tika-batch/src/main/resources/org/apache/tika/batch/fs/default-tika-batch-config.xml
* 
/tika/trunk/tika-batch/src/test/java/org/apache/tika/parser/mock/MockParserFactory.java
* 
/tika/trunk/tika-batch/src/test/resources/tika-batch-config-MockConsumersBuilder.xml
* /tika/trunk/tika-batch/src/test/resources/tika-batch-config-broken.xml
* /tika/trunk/tika-batch/src/test/resources/tika-batch-config-test.xml
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/DigestingParser.java
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/utils
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/utils/CommonsDigester.java
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/TikaTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/DetectorResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/LanguageResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/MetadataResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/RecursiveMetadataResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaDetectors.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaMimeTypes.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaParsers.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaUtils.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaVersion.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaWelcome.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TranslateResource.java
* 
/tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
* /tika/trunk/tika-server/src/test/java/org/apache/tika/server/CXFTestBase.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/DetectorResourceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/LanguageResourceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/MetadataResourceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataResourceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceOffTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaDetectorsTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaParsersTest.java
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java