Thank you, Nick.  For the reasons you listed, I'm averse to adding fileUrl 
back, but I'm not entirely against it.

Would it be as much of a disaster to require the user to allow the fileUrl 
capability on the commandline at server startup?  We could add some menacing 
"all bets are off, we hope you know what you're doing" warning.

If we went with something like this, we could allow all urls, and users 
wouldn't have to ship the bytes via the network, tika-server could read local 
files from the file share.

This might still be a remarkably bad idea...


Cheers,

             Tim

P.S. 
> My main concern in accessing the Tika libraries via TikaJAXRS is the 
> performance overheads associated ?>with going through sockets (and possible 
> the additional memory/file copying of file data if fileUrl is not >available).

In my experience, depending on the file types, y, there's definitely some 
overhead, but the bottleneck is in the parsers (esp for complex document 
formats -- msoffice, pdf, etc), not data sloshing.

-----Original Message-----
From: John Dougrez-Lewis [mailto:jle...@lightblue.com] 
Sent: Wednesday, September 14, 2016 2:35 AM
To: dev@tika.apache.org
Cc: 'Nick Burch' <apa...@gagravarr.org>
Subject: RE: Query on correct use of 'fileUrl' in TikaJAXRS Server to extract 
document at remote url - my request is not working

Thanks for the insight.

My interest (as a developer) in TikaJAXRS is that it provides a nice 
encapsulation of Tika functionality which is accessible across language 
boundaries. The fact that it can then also cross network boundaries is of 
secondary importance to me.

I'm developing code in C++ and I'd like to be able to access Tika's 
capabilities.

The TikaJAXRS offers an easy way in. If the fileURL functionality was in place 
and running TikaJAXRS on the same box as the Client and restricted to listening 
on 127.0.0.1 with the file:// check as well, this would limit some of the 
dangers listed below - an attacker would then need access to your host box 
itself in which case you would have already lost.

My main concern in accessing the Tika libraries via TikaJAXRS is the 
performance overheads associated with going through sockets (and possible the 
additional memory/file copying of file data if fileUrl is not available).

Short of the Herculean task of porting the entirety of Tika from java to
C++, are there any better, well-established, more performant ways of
interfacing to Tika from C++ to the java Tika code ?


Regards,

John

Reply via email to