[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187290#comment-15187290 ] Tim Allison commented on TIKA-1663: --- [~gagravarr], am I right in that we cannot do this now: {code} {code} > Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata > --- > > Key: TIKA-1663 > URL: https://issues.apache.org/jira/browse/TIKA-1663 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: digesting_parser_v1.patch > > > It might be useful to integrate commons' DigestUtils and allow users to > easily add the MD5 or other supported hashes to the Metadata object. > Anyone else find this of use? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177851#comment-15177851 ] Tim Allison commented on TIKA-1663: --- Thank you, Nick. I somewhat prefer the first option (once we add the parameter setting). I'm hesitant to promote the DigestingParser (wrapper) to a special place, but I'm game if the community is. > Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata > --- > > Key: TIKA-1663 > URL: https://issues.apache.org/jira/browse/TIKA-1663 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: digesting_parser_v1.patch > > > It might be useful to integrate commons' DigestUtils and allow users to > easily add the MD5 or other supported hashes to the Metadata object. > Anyone else find this of use? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176652#comment-15176652 ] Nick Burch commented on TIKA-1663: -- The other parser decorators are specified with options inside the parent parser, eg mime includes or excludes are decorators given as options to the main parser. In some ways, this is quite nice, as you do the main definition on the thing that'll do the work, then the decorators after One option, for the general case, would be to add additional decorators too, eg http://tika.apache.org/1.12/configuring.html#Configuring_Parsers becomes {code} image/jpeg application/pdf {code} For the specific case of the digester, it's a well known thing, so we could give it custom tags. That would make things clearer, and would get round the parameter issue. One option is: {code} image/jpeg application/pdf MD5,SHA256 {code} The other to keep it more in line with the mime includes/excludes is: {code} image/jpeg application/pdf MD5 SHA256 {code} What do people think? > Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata > --- > > Key: TIKA-1663 > URL: https://issues.apache.org/jira/browse/TIKA-1663 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: digesting_parser_v1.patch > > > It might be useful to integrate commons' DigestUtils and allow users to > easily add the MD5 or other supported hashes to the Metadata object. > Anyone else find this of use? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173179#comment-15173179 ] Chris A. Mattmann commented on TIKA-1663: - thanks [~thammegowda] and [~talli...@apache.org] :) > Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata > --- > > Key: TIKA-1663 > URL: https://issues.apache.org/jira/browse/TIKA-1663 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: digesting_parser_v1.patch > > > It might be useful to integrate commons' DigestUtils and allow users to > easily add the MD5 or other supported hashes to the Metadata object. > Anyone else find this of use? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172679#comment-15172679 ] Thamme Gowda N commented on TIKA-1663: -- Yes, I like to work on TIKA-1508, provided 6 to 8 days timeline from now. > Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata > --- > > Key: TIKA-1663 > URL: https://issues.apache.org/jira/browse/TIKA-1663 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: digesting_parser_v1.patch > > > It might be useful to integrate commons' DigestUtils and allow users to > easily add the MD5 or other supported hashes to the Metadata object. > Anyone else find this of use? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172626#comment-15172626 ] Tim Allison commented on TIKA-1663: --- In tika-batch/tika-app, I did a not-so-great-workaround with an interface for a ParserFactory, and then I hardcoded a parser factory that wrapped a DigestingParser around the AutoDetectParser, and then wrapped all of that in a RecursiveParserWrapper...not happy with that and look forward to being able to configure this via the config file. > Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata > --- > > Key: TIKA-1663 > URL: https://issues.apache.org/jira/browse/TIKA-1663 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: digesting_parser_v1.patch > > > It might be useful to integrate commons' DigestUtils and allow users to > easily add the MD5 or other supported hashes to the Metadata object. > Anyone else find this of use? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172605#comment-15172605 ] Tim Allison commented on TIKA-1663: --- Y, I much prefer #2. The parameter part will be solved by TIKA-1508. I haven't looked at that one in a while, though, but it is important. Any interest in contributing? Specifying which parser the digesting parser wraps...hmmm...y, it could be handled as you suggest. [~gagravarr], for parser wrappers (where one parser takes a delegate), any recommendation on how to specify this via config currently? As a side note, beware of TIKA-1701 and truncated files! > Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata > --- > > Key: TIKA-1663 > URL: https://issues.apache.org/jira/browse/TIKA-1663 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: digesting_parser_v1.patch > > > It might be useful to integrate commons' DigestUtils and allow users to > easily add the MD5 or other supported hashes to the Metadata object. > Anyone else find this of use? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172527#comment-15172527 ] Thamme Gowda N commented on TIKA-1663: -- [~chrismattmann] [~talli...@mitre.org] We need SHA digest of raw content for MEMEX project. I tried to enable digesting parser by editing our config file: {code} . {code} This doesnt work for the obvious reason that we havent told which digest algorithm. After checking https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java, I found that DigestingParser is a flexible framwork and takes constructor args. So, I propose two options: 1. We offer few popular implementations like SHA, MD5 parsers which doesnt need constructor args. This will enable us to activate them by editing the config xml file instead of source code. 2. We enhance tika configuration framework and these flexible parsers to accept runtime arguments, so that the flexibility and ease of use is preserved. For instance, if we can supply digest algorithm name from config file and let the DigestingParser use it to instantiate, then we dont need to edit source code of applications. {code} MD5 . {code} I vote for option 2 even though it is slightly more work, but I feel it is the way to go. I donot know if Tika already has a support for option 2 by accepting runtime arguments from config file. I faced a similar issue with NamedEntityParser, but found a workaround by using System properties. > Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata > --- > > Key: TIKA-1663 > URL: https://issues.apache.org/jira/browse/TIKA-1663 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Attachments: digesting_parser_v1.patch > > > It might be useful to integrate commons' DigestUtils and allow users to > easily add the MD5 or other supported hashes to the Metadata object. > Anyone else find this of use? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1663) Add a DigestingParser to add MD5/SHA-X hashes as fields in Metadata
[ https://issues.apache.org/jira/browse/TIKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604476#comment-14604476 ] Hudson commented on TIKA-1663: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #769 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/769/]) TIKA-1663 add a DigestingParser (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1687981) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-app/src/main/java/org/apache/tika/batch * /tika/trunk/tika-app/src/main/java/org/apache/tika/batch/DigestingAutoDetectParserFactory.java * /tika/trunk/tika-app/src/main/java/org/apache/tika/batch/builders * /tika/trunk/tika-app/src/main/java/org/apache/tika/batch/builders/AppParserFactoryBuilder.java * /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java * /tika/trunk/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java * /tika/trunk/tika-app/src/main/resources/tika-app-batch-config.xml * /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchIntegrationTest.java * /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java * /tika/trunk/tika-app/src/test/resources/log4j_batch_process_test.properties * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/AutoDetectParserFactory.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/ParserFactory.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/IParserFactoryBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/builders/ParserFactoryBuilder.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java * /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/fs/builders/BasicTikaFSConsumersBuilder.java * /tika/trunk/tika-batch/src/main/resources/org/apache/tika/batch/fs/default-tika-batch-config.xml * /tika/trunk/tika-batch/src/test/java/org/apache/tika/parser/mock/MockParserFactory.java * /tika/trunk/tika-batch/src/test/resources/tika-batch-config-MockConsumersBuilder.xml * /tika/trunk/tika-batch/src/test/resources/tika-batch-config-broken.xml * /tika/trunk/tika-batch/src/test/resources/tika-batch-config-test.xml * /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/DigestingParser.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/utils * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/utils/CommonsDigester.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/TikaTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/DigestingParserTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/DetectorResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/LanguageResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/MetadataResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/RecursiveMetadataResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaDetectors.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaMimeTypes.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaParsers.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaUtils.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaVersion.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TikaWelcome.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/TranslateResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/CXFTestBase.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/DetectorResourceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/LanguageResourceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/MetadataResourceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataResourceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceOffTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaDetectorsTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaParsersTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java