AW: Ruta 2.4.0 - High memory needs

Armin.Wegner Thu, 18 Aug 2016 05:37:55 -0700

Hi Peter,

doesn't work like that for me. I've removed DefaultSeeder and added my own 
seeder implementing RutaAnnotationSeeder. Now, I have all of Ruta's standard 
tokens plus my own tokenization at the same time.


Cheers,
Armin

-----Ursprüngliche Nachricht-----
Von: Peter Klügl [mailto:peter.klu...@averbis.com] 
Gesendet: Donnerstag, 18. August 2016 14:23
An: user@uima.apache.org
Betreff: Re: Ruta 2.4.0 - High memory needs

Hi,


Am 18.08.2016 um 14:17 schrieb armin.weg...@bka.bund.de:
> Hello Peter!
>
> Please correct me if I'm wrong. My understanding of how Ruta works is as 
> follows. 
>
> 1. The RutaBasic annotations are always created. RETAINTYPE and FILTERTYPE 
> have no influence of annotation creation. They influence the use of those 
> types in rules, only.
>


yes


> 2. The configuration parameter seeders adds additional seeders, only. It 
> cannot be used to remove the default seeder.

No, the parameter specifies all seeder. The default value is is set to
the default seeder. If you set it to an empty list, no seeders should be
applied. If you want to use your own seeder, you simply set the
parameter to your implementation.

(I am really sure of that, but I will check it again...)


Best,

Peter

> So how do I tell Ruta not to use the default seeder? How do I tell Ruta to 
> use my own seeder? Do I have to replace 
> org.apache.uima.ruta.seed.DefaultSeeder.java? Won't this break Ruta?
>
> Best,
> Armin
>
>
> -----Ursprüngliche Nachricht-----
> Von: Peter Klügl [mailto:peter.klu...@averbis.com] 
> Gesendet: Mittwoch, 10. August 2016 14:50
> An: user@uima.apache.org
> Betreff: Re: Ruta 2.4.0 - High memory needs
>
> Hi,
>
>
> 18MB of text in a CAS, well that's a quite big sofa.
>
>
> Yes, there are some tricks and best prectices.
>
>
> First of all, there is the configuration parameter "lowMemoryProfile",
> which reduces the information stored in RutaBasic. It should reduce the
> memory usage considerably, but the processing will take longer,
> especially if the type hierarchy is rather deep. The unit tests for it
> do not cover all functionality of ruta. I only test all unit test with
> this option once in a while, and I haven't done this for some time.
>
>  
>
> The second thing to do in order to reduce the memory usage is to
> minimize the annotations and especially the RutaBasic annotations. These
> are automatically created and build up a minimal, atomic partioning of
> the document. This means that you should create only annotations as
> small as you need them, and only annotations where you need them. The
> first option here is to remove/replace the seeder if you do not rely on
> these annotations (ANY, CW, NUM, PERIOD, ...), or replace it with a
> tokenizer if you did not include one anyway. This will get you rid of
> the annotations for whitespaces and so on and the corresponding
> RutaBasic annotations. Maybe you also do not need any kind of annotation
> for each section (e.g, restrict the matching window). Optimization
> strongly depends on the use case and the actual rules.
>
> Please mind that text spans without any annotations will be considered
> invisible concerning sequential matching.
>
>
> btw, the speed of you rules can be improved, especially with the
> upcoming 2.5.0 release. Besides that, PARTOFNEQ is one of the slowest
> conditions in Ruta. I'd rather recommend something like:
>
> Full->{ANY @Full{-> UNMARK(Full)};Full{-> UNMARK(Full) ANY};};
>
>
> Best,
>
>
> Peter
>
>
> Am 09.08.2016 um 12:37 schrieb armin.weg...@bka.bund.de:
>> Hello again!
>>
>> One down, one to go. Are there best practices or tricks to reduce Ruta's 
>> memory needs? I tried to use the following script to merge names. 
>>
>> Document{->GREEDYANCHORING(true)};
>> First+ Full {->MARK(Full)};
>> Full Last+ {->MARK(Full)};
>> First+ Last+ {->MARK(Full)};
>> Document{->GREEDYANCHORING(false)};
>> Full{PARTOFNEQ(Full) -> UNMARK(Full)};
>> First{PARTOF(Full) -> UNMARK(First)};
>> Last{PARTOF(Full) -> UNMARK(Last)};
>>
>> The engine description is create by ruta-maven-plugin:2.4.0 and used with 
>> uimaFIT's 
>> AnalysisEngineFactory.createEngineDescription("fullyQualifiedDescriptorNameWithoutXmlExtension").
>>  For a 18 Mbyte text, it needs Gbytes of RAM.
>>
>> Cheers,
>> Armin

pgpOxx1eWU26G.pgp
Description: PGP signature

AW: Ruta 2.4.0 - High memory needs

Reply via email to