[ https://issues.apache.org/jira/browse/TIKA-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227095#comment-17227095 ]
Luís Filipe Nassif commented on TIKA-3221: ------------------------------------------ My 2 cents, in the past I had ConcurrrentModificationExceptions by metadata being read by client thread, while timed out parsing threads eventually writes to metadata. I "solved" with a try-catch block. IMHO, ConcurrentHashMap is lock free, so Tim's conclusion about small overhead makes sense. I think using it could be good. But the best solution to handle timeouts is using ForkParser or tika server. > /rmeta/text endpoint - allow a "max parse time" parameter where after > exceeded, return bytes/metadata mangaed to get up to that point > ------------------------------------------------------------------------------------------------------------------------------------- > > Key: TIKA-3221 > URL: https://issues.apache.org/jira/browse/TIKA-3221 > Project: Tika > Issue Type: Bug > Reporter: Nicholas DiPiazza > Priority: Major > > Can we make a change to the > {code} > /rmeta/text > {code} > endpoint to allow a "max parse time" parameter where after exceeded, return > bytes/metadata managed to get up to that point. > Motivation: > I have a massive number of documents that I need to fetch through apache tika > server. > Prior to making a switch to tika server, I used a project I created myself > https://github.com/nddipiazza/tika-fork that created tika forked VMs and > would send work to the VMs through sockets directly. > This was OK but super complicated so I chose to switch to the Tika jetty > server for simplicity's sake. > Tika Server works great for the most part for this use case... But one > feature I had before was that I could say "If I don't get a result within > MAX_PARSE_TIMEOUT_MS, then stop parsing at that moment and return the bytes > we managed to get up to that point. > This is because with the massive number of documents I need to parse, I > cannot afford to have any parse hang longer than a certain amount of time. > But conversely, if I make timeout 20 seconds, then I suffer massive gaps with > *no* content at all. > With the rmeta/text method, we recently added the ability to send a > writeLimit where we will stop parsing after we reach that number of bytes. > I'm hoping we can do the same for the time parsed. Perhaps when checking byte > size, periodically check time and quit parser in the same way. -- This message was sent by Atlassian Jira (v8.3.4#803005)