metadata mangaed to get up to that point

Jira Thu, 05 Nov 2020 16:59:05 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227095#comment-17227095
 ]


Luís Filipe Nassif commented on TIKA-3221:
------------------------------------------

My 2 cents, in the past I had ConcurrrentModificationExceptions by metadata 
being read by client thread, while timed out parsing threads eventually writes 
to metadata. I "solved" with a try-catch block. IMHO, ConcurrentHashMap is lock 
free, so Tim's conclusion about small overhead makes sense. I think using it 
could be good.

But the best solution to handle timeouts is using ForkParser or tika server.

> /rmeta/text endpoint - allow a "max parse time" parameter where after 
> exceeded, return bytes/metadata mangaed to get up to that point
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3221
>                 URL: https://issues.apache.org/jira/browse/TIKA-3221
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> Can we make a change to the 
> {code}
> /rmeta/text
> {code}
> endpoint to allow a "max parse time" parameter where after exceeded, return 
> bytes/metadata managed to get up to that point.
> Motivation:
> I have a massive number of documents that I need to fetch through apache tika 
> server.
> Prior to making a switch to tika server, I used a project I created myself 
> https://github.com/nddipiazza/tika-fork that created tika forked VMs and 
> would send work to the VMs through sockets directly.
> This was OK but super complicated so I chose to switch to the Tika jetty 
> server for simplicity's sake.
> Tika Server works great for the most part for this use case... But one 
> feature I had before was that I could say "If I don't get a result within 
> MAX_PARSE_TIMEOUT_MS, then stop parsing at that moment and return the bytes 
> we managed to get up to that point.
> This is because with the massive number of documents I need to parse, I 
> cannot afford to have any parse hang longer than a certain amount of time. 
> But conversely, if I make timeout 20 seconds, then I suffer massive gaps with 
> *no* content at all.
> With the rmeta/text method, we recently added the ability to send a 
> writeLimit where we will stop parsing after we reach that number of bytes.
> I'm hoping we can do the same for the time parsed. Perhaps when checking byte 
> size, periodically check time and quit parser in the same way. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3221) /rmeta/text endpoint - allow a "max parse time" parameter where after exceeded, return bytes/metadata mangaed to get up to that point

Reply via email to