+1 as always to Erick’s advice. DIH is only a PoC.

We do have a DigestingParser in Tika, and when you combine that w the
RecursiveParserWrapper, you can get digests not only of the main file but
also on all embedded files/attachments...which can be pretty neat for some
use cases.

Operators are standing by on the user list for Tika when you have
questions. :)

Cheers,
    Tim

On Fri, May 25, 2018 at 11:10 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> I'd consider using a separate Java program that uses Tika directly, or
> one of various services. Then you can assemble whatever you please
> before sending the doc to Solr. There are multiple reasons to
> recommend this, see:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> There are other reasons why using extractingRequestHandler is
> problematic in production, the biggest one being that it can blow up
> your server. Tika has to try to cope with every variant of every
> document format it processes, and I personally guarantee that the
> implementations from company X (which is no longer in business) for a
> PDF file  (from a spec current 10 years ago) may "interpret" that
> spec...er...freely ;) And Tika has to then try to cope. It does a
> brilliant job, but there's going to be case N+1
>
> The inference, of course, is that extractingRequestHandler is largely
> a PoC tool IMO, it gets people going without having to write an external
> program but not something I'd recommend for production.
>
> Best,
> Erick
>
> On Thu, May 24, 2018 at 10:06 PM, Thomas Lustig <tm.lus...@gmail.com>
> wrote:
> > dear community,
> >
> > I would like to automatically add a sha256 filehash to a Document field
> > after a binary file is posted to a ExtractingRequestHandler.
> > First i thought, that the ExtractingRequestHandler has such a feature,
> but
> > so far i did not find a configuration.
> > It was mentioned that I should implement my own  Update Request Processor
> > to calculate the hash and add it to a field.
> > The  SignatureUpdateProcessor seemed to be an out-of-the-box option, but
> it
> > only supports md5 and also does not access the raw binary stream.
> >
> > The important thing is that i do need the binary stream of the uploaded
> > file to calculate a correct hashvalue (e.g. md5, sha256,..)
> > Is it possible to also arrange this with a ScriptUpdateProcessor and
> > javascript?.
> >
> > thanks in advance for any help
> >
> > Tom
>

Reply via email to