I'll check that (writing some unit test right now)
Am 18.08.2016 um 14:36 schrieb armin.weg...@bka.bund.de: > Hi Peter, > > doesn't work like that for me. I've removed DefaultSeeder and added my own > seeder implementing RutaAnnotationSeeder. Now, I have all of Ruta's standard > tokens plus my own tokenization at the same time. > > Cheers, > Armin > > -----Ursprüngliche Nachricht----- > Von: Peter Klügl [mailto:peter.klu...@averbis.com] > Gesendet: Donnerstag, 18. August 2016 14:23 > An: user@uima.apache.org > Betreff: Re: Ruta 2.4.0 - High memory needs > > Hi, > > > Am 18.08.2016 um 14:17 schrieb armin.weg...@bka.bund.de: >> Hello Peter! >> >> Please correct me if I'm wrong. My understanding of how Ruta works is as >> follows. >> >> 1. The RutaBasic annotations are always created. RETAINTYPE and FILTERTYPE >> have no influence of annotation creation. They influence the use of those >> types in rules, only. >> > > yes > > >> 2. The configuration parameter seeders adds additional seeders, only. It >> cannot be used to remove the default seeder. > No, the parameter specifies all seeder. The default value is is set to > the default seeder. If you set it to an empty list, no seeders should be > applied. If you want to use your own seeder, you simply set the > parameter to your implementation. > > (I am really sure of that, but I will check it again...) > > > Best, > > Peter > >> So how do I tell Ruta not to use the default seeder? How do I tell Ruta to >> use my own seeder? Do I have to replace >> org.apache.uima.ruta.seed.DefaultSeeder.java? Won't this break Ruta? >> >> Best, >> Armin >> >> >> -----Ursprüngliche Nachricht----- >> Von: Peter Klügl [mailto:peter.klu...@averbis.com] >> Gesendet: Mittwoch, 10. August 2016 14:50 >> An: user@uima.apache.org >> Betreff: Re: Ruta 2.4.0 - High memory needs >> >> Hi, >> >> >> 18MB of text in a CAS, well that's a quite big sofa. >> >> >> Yes, there are some tricks and best prectices. >> >> >> First of all, there is the configuration parameter "lowMemoryProfile", >> which reduces the information stored in RutaBasic. It should reduce the >> memory usage considerably, but the processing will take longer, >> especially if the type hierarchy is rather deep. The unit tests for it >> do not cover all functionality of ruta. I only test all unit test with >> this option once in a while, and I haven't done this for some time. >> >> >> >> The second thing to do in order to reduce the memory usage is to >> minimize the annotations and especially the RutaBasic annotations. These >> are automatically created and build up a minimal, atomic partioning of >> the document. This means that you should create only annotations as >> small as you need them, and only annotations where you need them. The >> first option here is to remove/replace the seeder if you do not rely on >> these annotations (ANY, CW, NUM, PERIOD, ...), or replace it with a >> tokenizer if you did not include one anyway. This will get you rid of >> the annotations for whitespaces and so on and the corresponding >> RutaBasic annotations. Maybe you also do not need any kind of annotation >> for each section (e.g, restrict the matching window). Optimization >> strongly depends on the use case and the actual rules. >> >> Please mind that text spans without any annotations will be considered >> invisible concerning sequential matching. >> >> >> btw, the speed of you rules can be improved, especially with the >> upcoming 2.5.0 release. Besides that, PARTOFNEQ is one of the slowest >> conditions in Ruta. I'd rather recommend something like: >> >> Full->{ANY @Full{-> UNMARK(Full)};Full{-> UNMARK(Full) ANY};}; >> >> >> Best, >> >> >> Peter >> >> >> Am 09.08.2016 um 12:37 schrieb armin.weg...@bka.bund.de: >>> Hello again! >>> >>> One down, one to go. Are there best practices or tricks to reduce Ruta's >>> memory needs? I tried to use the following script to merge names. >>> >>> Document{->GREEDYANCHORING(true)}; >>> First+ Full {->MARK(Full)}; >>> Full Last+ {->MARK(Full)}; >>> First+ Last+ {->MARK(Full)}; >>> Document{->GREEDYANCHORING(false)}; >>> Full{PARTOFNEQ(Full) -> UNMARK(Full)}; >>> First{PARTOF(Full) -> UNMARK(First)}; >>> Last{PARTOF(Full) -> UNMARK(Last)}; >>> >>> The engine description is create by ruta-maven-plugin:2.4.0 and used with >>> uimaFIT's >>> AnalysisEngineFactory.createEngineDescription("fullyQualifiedDescriptorNameWithoutXmlExtension"). >>> For a 18 Mbyte text, it needs Gbytes of RAM. >>> >>> Cheers, >>> Armin