Thanks to all for your advices.
In my specific case, this was a Ruta problem - Peter, I filed a JIRA issue with 
a minimal example - which would advocate for the « TooManyMatchesException » 
feature you propose. I vote for it.

Of course, I already limit the size of input texts, but this is not enough.
One of the main strengths of UIMA is to be able to integrate annotators 
produced by third-parties. And each annotator is based on assumptions, at least 
to have a text as an input, formed by words, etc. Thus, pipelines get more and 
more complex, without the need to code all processig. But, in a production 
environment, anything can happen, assumptions may not be respected (e.g. 
non-textual data can be sent to the engine(s), etc). Sh** always happen in 
production.

My case is a more specific one, but I’m sure it can be generalized.

Thus, any feature that can help limiting the damage of non-expected input would 
be welcome. And a limited-size FsIndexRepository seems to me a simple yet 
powerful enough solution to many problems.

Best,

— Hugues


PS: appart from occasional problems, Ruta is a great platform for information 
extraction. I love it!

> Le 30 avr. 2017 à 12:57, Peter Klügl <peter.klu...@averbis.com> a écrit :
> 
> Hi,
> 
> 
> here are some ruta-specific comments additionally to Thilo and Marshall's 
> answers.
> 
> - if you do not want to split the CAS in smaller ones, you can also sometimes 
> apply the rules just on some parts of the document (-> less annotations/rule 
> matches created)
> 
> - there is an discussion related to this topic (about memory usage in ruta): 
> https://issues.apache.org/jira/browse/UIMA-5306
> 
> - I can include configuration parameters which limit the allowed amount of 
> rule matches and rule element matches of one rule/rule element. If a rule or 
> rule element exceeds it, a new runtime exception is thrown. I'll open a jira 
> ticket for that. This is not a solution for the problem in my opinion, but it 
> can help to identify and fix the problematic rules.
> 
> - I do not want to include code to directly restrict the max memory in ruta. 
> That should rather happen in the framework or in the code that calls/applies 
> the ruta analysis engine.
> 
> - I think there is a problem in ruta and there are several aspects that need 
> to be considered here: the actual rules, the partitioning with RutaBasic, 
> flaws in the implementation and the configuration parameters of the analysis 
> engine
> 
> - Are the rules inefficient (combinatory explosion)? I see ruta more and more 
> as a programming language for faster creating maintainable analysis engines. 
> You can write efficient and ineffiecient code. If the code/rules are too slow 
> or take too long, you should refactor it and replace them with a more 
> efficient approach. Something like ANY+ is a good indicator that the rules 
> are not optimal, you should only match on things if you have to. There is 
> also profiling functionality in the Ruta Workbench which shows you how long 
> which rule took and how long specific conditions/action took. Well, this is 
> information about the speed but not about the memory, but many rule matches 
> take longer and require more memory, so it could be an indicator.
> 
> - There are two specific aspects how ruta spends its memory: RutaBasic and 
> RuleMatches. RutaBasic stores additional information which speeds up the rule 
> inference and enables specific functionality. The rule matches are needed to 
> remember where something matched, for the conditions and actions. You can 
> reduce the memory usage by reducing the amount of RutaBasic annotations, the 
> amount of the annotations indexed in the RutaBasic annotations, or by 
> reducing the amount of RuleMatches -> refactoring the rules.
> 
> - There are plans to make the implementation of RutaBasic more efficient, by 
> using more efficient data structures (there are some prototypes mentioned in 
> the issue linked above). And I added some new configuration parameters (in 
> ruta 2.6.0 I think) which control which information is stored in RutaBasic, 
> e.g, you do not need information about annotations if they or their types are 
> not used in the rules.
> 
> - I think there is a flaw in the implementation which causes your problem, 
> and which can be fixed. I'll investigate it when I find the time. If you can 
> provide some minimal (synthetic) example for reproducing it, that would be 
> great.
> 
> - There is the configuration parameter lowMemoryProfile for reducing the 
> stuff stored in RutaBasic which reduces the memory usage but makes the rules 
> run slower.
> 
> 
> Best,
> 
> 
> Peter
> 
> 
> 
> Am 29.04.2017 um 12:53 schrieb Hugues de Mazancourt:
>> Hello UIMA users,
>> 
>> I’m currently putting a Ruta-based system in production and I sometimes run 
>> out of memory.
>> This is usually caused by combinatory explosion in Ruta rules. These rules 
>> are not necessary faulty: they are adapted to the documents I expect to 
>> parse. But as this is an open system, people can upload whatever they want 
>> and the parser crashes by multiplying annotations (or at least takes 20 
>> minutes in garbage-collecting millions of annotations).
>> 
>> Thus, my question is: is there a way to limit the memory used by an 
>> annotator, or to limit the number of annotations made by an annotator, or to 
>> limit the number of matches made by Ruta ?
>> I prefer cancelling a parse for a given document than a 20 minutes downtime 
>> of the whole system.
>> 
>> Several UIMA-based services run in production, I guess that others certainly 
>> have hit the same problem.
>> 
>> Any hint on that topic would be very helpful.
>> 
>> Thanks,
>> 
>> Hugues de Mazancourt
>> http://about.me/mazancourt
>> 
>> 
>> 
>> 
>> 
> 

Reply via email to