+1 as always to Erick’s advice. DIH is only a PoC. We do have a DigestingParser in Tika, and when you combine that w the RecursiveParserWrapper, you can get digests not only of the main file but also on all embedded files/attachments...which can be pretty neat for some use cases.
Operators are standing by on the user list for Tika when you have questions. :) Cheers, Tim On Fri, May 25, 2018 at 11:10 AM Erick Erickson <erickerick...@gmail.com> wrote: > I'd consider using a separate Java program that uses Tika directly, or > one of various services. Then you can assemble whatever you please > before sending the doc to Solr. There are multiple reasons to > recommend this, see: > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > There are other reasons why using extractingRequestHandler is > problematic in production, the biggest one being that it can blow up > your server. Tika has to try to cope with every variant of every > document format it processes, and I personally guarantee that the > implementations from company X (which is no longer in business) for a > PDF file (from a spec current 10 years ago) may "interpret" that > spec...er...freely ;) And Tika has to then try to cope. It does a > brilliant job, but there's going to be case N+1 > > The inference, of course, is that extractingRequestHandler is largely > a PoC tool IMO, it gets people going without having to write an external > program but not something I'd recommend for production. > > Best, > Erick > > On Thu, May 24, 2018 at 10:06 PM, Thomas Lustig <tm.lus...@gmail.com> > wrote: > > dear community, > > > > I would like to automatically add a sha256 filehash to a Document field > > after a binary file is posted to a ExtractingRequestHandler. > > First i thought, that the ExtractingRequestHandler has such a feature, > but > > so far i did not find a configuration. > > It was mentioned that I should implement my own Update Request Processor > > to calculate the hash and add it to a field. > > The SignatureUpdateProcessor seemed to be an out-of-the-box option, but > it > > only supports md5 and also does not access the raw binary stream. > > > > The important thing is that i do need the binary stream of the uploaded > > file to calculate a correct hashvalue (e.g. md5, sha256,..) > > Is it possible to also arrange this with a ScriptUpdateProcessor and > > javascript?. > > > > thanks in advance for any help > > > > Tom >