On Wed, 14 Sep 2016, Allison, Timothy B. wrote:
Would it be as much of a disaster to require the user to allow the fileUrl capability on the commandline at server startup? We could add some menacing "all bets are off, we hope you know what you're doing" warning.

With a special switch, and a warning, enabling file:/// again wouldn't be too bad in my view.

I'm not sure about arbitrary URLs though - there's the security + dos stuff, plus the fact that we won't be doing robots checking / niceness / etc. For anyone doing remote URLs, I think they do need to be using a proper + safe + server-friendly crawler, then passing the result of a successful fetch to the Tika server

My main concern in accessing the Tika libraries via TikaJAXRS is the performance overheads associated ?>with going through sockets (and possible the additional memory/file copying of file data if fileUrl is not >available).

In my experience, depending on the file types, y, there's definitely some overhead, but the bottleneck is in the parsers (esp for complex document formats -- msoffice, pdf, etc), not data sloshing.

I agree - for almost all formats, the slow bit isn't byte shuffling it's parsing

Nick

Reply via email to