[
https://issues.apache.org/jira/browse/TIKA-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605925#comment-16605925
]
Tim Allison commented on TIKA-2725:
-----------------------------------
bq. What is tika-server typical env? stand-alone, distributed ... like replicas
in cluster?
It varies, I'm sure. Not sure what most common use case is. I would hope
distributed -- swarm or similar.
bq. Are there some time limitation for recovery?
I think whoever starts the server should be able to set the threshold for
timeouts per file...although I may misunderstand your question.
bq. How do we know what point to start processing from?
That wouldn't be tika-server's problem. Clients calling tika-server would get
an error message, or potentially no response within a socket/http timeout
range. They should not reprocess those docs.
bq. Do we mark documents which were processed?
Same as above, that's a client concern.
bq. For example, if tika-server had run on Docker swarm/K8S then orchestrator
would have restarted a failed replica itself
To confirm that I understand this correctly, currently, if the tika-server
process dies, swarm/k8s will automatically restart it? That's good to hear.
However, we don't currently have the watcher thread within tika-server to kill
its own process on oom/timeout...so as it is now, it would have to be something
catastrophic taking down tika-server (operating system, perhaps?).
> Make tika-server robust against ooms/infinite loops/memory leaks
> ----------------------------------------------------------------
>
> Key: TIKA-2725
> URL: https://issues.apache.org/jira/browse/TIKA-2725
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Major
>
> Currently, tika-server is vulnerable to ooms, inifinite loops and memory
> leaks. I see two ways of making it robust:
> 1) use the ForkParser
> 2) have tika-server spawn a child process that actually runs the server, put
> a watcher thread in the child that will kill the child on oom/timeout/after x
> files. The parent process can then restart the child if it dies.
> I somewhat prefer 2) so that we don't have to doubly pass the inputstream. I
> propose 2), and I propose making it optional in Tika 1.x, but then the
> default in Tika 2.x. We could also add a status ping from parent to child in
> case the child gets caught up in stop the world gc (h/t [~bleskes]).
> Other options/recommendations?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)