Thanks to all for your advices. In my specific case, this was a Ruta problem - Peter, I filed a JIRA issue with a minimal example - which would advocate for the « TooManyMatchesException » feature you propose. I vote for it.
Of course, I already limit the size of input texts, but this is not enough. One of the main strengths of UIMA is to be able to integrate annotators produced by third-parties. And each annotator is based on assumptions, at least to have a text as an input, formed by words, etc. Thus, pipelines get more and more complex, without the need to code all processig. But, in a production environment, anything can happen, assumptions may not be respected (e.g. non-textual data can be sent to the engine(s), etc). Sh** always happen in production. My case is a more specific one, but I’m sure it can be generalized. Thus, any feature that can help limiting the damage of non-expected input would be welcome. And a limited-size FsIndexRepository seems to me a simple yet powerful enough solution to many problems. Best, — Hugues PS: appart from occasional problems, Ruta is a great platform for information extraction. I love it! > Le 30 avr. 2017 à 12:57, Peter Klügl <peter.klu...@averbis.com> a écrit : > > Hi, > > > here are some ruta-specific comments additionally to Thilo and Marshall's > answers. > > - if you do not want to split the CAS in smaller ones, you can also sometimes > apply the rules just on some parts of the document (-> less annotations/rule > matches created) > > - there is an discussion related to this topic (about memory usage in ruta): > https://issues.apache.org/jira/browse/UIMA-5306 > > - I can include configuration parameters which limit the allowed amount of > rule matches and rule element matches of one rule/rule element. If a rule or > rule element exceeds it, a new runtime exception is thrown. I'll open a jira > ticket for that. This is not a solution for the problem in my opinion, but it > can help to identify and fix the problematic rules. > > - I do not want to include code to directly restrict the max memory in ruta. > That should rather happen in the framework or in the code that calls/applies > the ruta analysis engine. > > - I think there is a problem in ruta and there are several aspects that need > to be considered here: the actual rules, the partitioning with RutaBasic, > flaws in the implementation and the configuration parameters of the analysis > engine > > - Are the rules inefficient (combinatory explosion)? I see ruta more and more > as a programming language for faster creating maintainable analysis engines. > You can write efficient and ineffiecient code. If the code/rules are too slow > or take too long, you should refactor it and replace them with a more > efficient approach. Something like ANY+ is a good indicator that the rules > are not optimal, you should only match on things if you have to. There is > also profiling functionality in the Ruta Workbench which shows you how long > which rule took and how long specific conditions/action took. Well, this is > information about the speed but not about the memory, but many rule matches > take longer and require more memory, so it could be an indicator. > > - There are two specific aspects how ruta spends its memory: RutaBasic and > RuleMatches. RutaBasic stores additional information which speeds up the rule > inference and enables specific functionality. The rule matches are needed to > remember where something matched, for the conditions and actions. You can > reduce the memory usage by reducing the amount of RutaBasic annotations, the > amount of the annotations indexed in the RutaBasic annotations, or by > reducing the amount of RuleMatches -> refactoring the rules. > > - There are plans to make the implementation of RutaBasic more efficient, by > using more efficient data structures (there are some prototypes mentioned in > the issue linked above). And I added some new configuration parameters (in > ruta 2.6.0 I think) which control which information is stored in RutaBasic, > e.g, you do not need information about annotations if they or their types are > not used in the rules. > > - I think there is a flaw in the implementation which causes your problem, > and which can be fixed. I'll investigate it when I find the time. If you can > provide some minimal (synthetic) example for reproducing it, that would be > great. > > - There is the configuration parameter lowMemoryProfile for reducing the > stuff stored in RutaBasic which reduces the memory usage but makes the rules > run slower. > > > Best, > > > Peter > > > > Am 29.04.2017 um 12:53 schrieb Hugues de Mazancourt: >> Hello UIMA users, >> >> I’m currently putting a Ruta-based system in production and I sometimes run >> out of memory. >> This is usually caused by combinatory explosion in Ruta rules. These rules >> are not necessary faulty: they are adapted to the documents I expect to >> parse. But as this is an open system, people can upload whatever they want >> and the parser crashes by multiplying annotations (or at least takes 20 >> minutes in garbage-collecting millions of annotations). >> >> Thus, my question is: is there a way to limit the memory used by an >> annotator, or to limit the number of annotations made by an annotator, or to >> limit the number of matches made by Ruta ? >> I prefer cancelling a parse for a given document than a 20 minutes downtime >> of the whole system. >> >> Several UIMA-based services run in production, I guess that others certainly >> have hit the same problem. >> >> Any hint on that topic would be very helpful. >> >> Thanks, >> >> Hugues de Mazancourt >> http://about.me/mazancourt >> >> >> >> >> >