On Wed, 14 Sep 2016, Allison, Timothy B. wrote:
Would it be as much of a disaster to require the user to allow the
fileUrl capability on the commandline at server startup? We could add
some menacing "all bets are off, we hope you know what you're doing"
warning.
With a special switch, and a warning, enabling file:/// again wouldn't be
too bad in my view.
I'm not sure about arbitrary URLs though - there's the security + dos
stuff, plus the fact that we won't be doing robots checking / niceness /
etc. For anyone doing remote URLs, I think they do need to be using a
proper + safe + server-friendly crawler, then passing the result of a
successful fetch to the Tika server
My main concern in accessing the Tika libraries via TikaJAXRS is the
performance overheads associated ?>with going through sockets (and
possible the additional memory/file copying of file data if fileUrl is
not >available).
In my experience, depending on the file types, y, there's definitely
some overhead, but the bottleneck is in the parsers (esp for complex
document formats -- msoffice, pdf, etc), not data sloshing.
I agree - for almost all formats, the slow bit isn't byte shuffling it's
parsing
Nick