I have a massive number of documents that I need to fetch through apache tika server.
Prior to making a switch to tika server, I used a project I created myself that created tika forked VMs and would send work to the VMs through sockets directly. This was OK but super complicated so I chose to switch to the Tika jetty server for simplicity's sake. Works great for the most part. But one feature I had before was that I could say "If I don't get a result within MAX_PARSE_TIMEOUT_MS, then stop parsing at the moment and return the bytes we managed to get up to that point. This is because with the massive number of documents I need to parse, I cannot afford to have any parse hang longer than a certain amount of time. With the rmeta/text method, we recently added the ability to send a writeLimit where we will stop parsing after we reach that number of bytes. Can we similarly add something that can "stop parsing after X ms have elapsed?" Currently, I'm having to do this through http socket timeouts but the problem then is it is all or nothing. And this will lead to huge gaps in my results because many of the docs hit socket timeouts when pounding the living crap out of Tika... these timeouts become more and more likely. -Nicholas
