[ 
https://issues.apache.org/jira/browse/TIKA-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226940#comment-17226940
 ] 

Nicholas DiPiazza edited comment on TIKA-3221 at 11/5/20, 7:33 PM:
-------------------------------------------------------------------

Yep this was a great idea "in theory" but in practice, it didn't do a damn 
thing so we should close this out, and my PR.

The problem I was really trying to solve is "prevent long hanging parses" but 
still return the bytes we managed to get before it hung.

That way even though the parse failed after say 20 seconds, we still get 
_something_ and we don't just have missing body contents all over in the index. 

But the problem is in practice those hanging parses are usually prior to any 
bytes getting written. 

So yeah, this was invalid. 

I actually moved away from using ForkParser it was hanging really bad compared 
to using a tika server. Seems like jetty manages the connections very well 
which ends up getting much more throughput. 

But yeah if I could get fork parser working as fast or faster than a pool of 
several locally running tika server, I'd be for doing whatever we can to avoid 
the missing body contents.


was (Author: ndipiazza_gmail):
Yep this was a great idea "in theory" but in practice, it didn't do a damn 
thing so we should close this out, and my PR.

The problem I was really trying to solve is "prevent long hanging parses" but 
still return the bytes we managed to get before it hung.

That way even though the parse failed after say 20 seconds, we still get 
_something_ and we don't just have missing body contents all over in the index. 

But the problem is in practice those hanging parses are usually prior to any 
bytes getting written. 

So yeah, this was invalid. 

> /rmeta/text endpoint - allow a "max parse time" parameter where after 
> exceeded, return bytes/metadata mangaed to get up to that point
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3221
>                 URL: https://issues.apache.org/jira/browse/TIKA-3221
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> Can we make a change to the 
> {code}
> /rmeta/text
> {code}
> endpoint to allow a "max parse time" parameter where after exceeded, return 
> bytes/metadata managed to get up to that point.
> Motivation:
> I have a massive number of documents that I need to fetch through apache tika 
> server.
> Prior to making a switch to tika server, I used a project I created myself 
> https://github.com/nddipiazza/tika-fork that created tika forked VMs and 
> would send work to the VMs through sockets directly.
> This was OK but super complicated so I chose to switch to the Tika jetty 
> server for simplicity's sake.
> Tika Server works great for the most part for this use case... But one 
> feature I had before was that I could say "If I don't get a result within 
> MAX_PARSE_TIMEOUT_MS, then stop parsing at that moment and return the bytes 
> we managed to get up to that point.
> This is because with the massive number of documents I need to parse, I 
> cannot afford to have any parse hang longer than a certain amount of time. 
> But conversely, if I make timeout 20 seconds, then I suffer massive gaps with 
> *no* content at all.
> With the rmeta/text method, we recently added the ability to send a 
> writeLimit where we will stop parsing after we reach that number of bytes.
> I'm hoping we can do the same for the time parsed. Perhaps when checking byte 
> size, periodically check time and quit parser in the same way. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to