Re: Using extensions

Peter Klügl Thu, 29 Aug 2019 05:23:00 -0700

Hi,


we are using a separate component for dictionary lookup, which can
combine multiple dictionaries and can also assign arbitrary feature
values. Most language-dependent information is extracted to
language-specific dictionaries and some language independent
dictionaries. There is a ticket to contribute parts of this
implementation to Ruta in order to replace WORDLIST, WORDTABLE and TRIE.
It's faster, more powerful and more stable. I did not have to time yet
to migrate it.

Most Ruta rules are language-independent. Some rules focus on different
constraints for separators, e.g., the space separator for thousands in
some languages instead of commas or periods.


I think this combination, as much fast dictionary lookup as possible and
then sequential rules for creating more complex expression if necessary,
is a good choice with respect to speed vs maintainability.

I have no recent numbers concerning throughput, but there have always
been another component that would have been optimized before.


I see no need for additional Ruta language extensions since they
increase the complexity of the pipeline without providing considerable
advantages for the use case.


Best,


Peter




Am 29.08.2019 um 11:39 schrieb Nikolai Krot:
> Hi Peter,
>
> From *your* perspective, for this particular task of turning written out
> numbers to their numerical representation, what would be better to
> implement it as a language extension (= one additional function) or a set
> of ruta rules?
> Against language extension speaks the fact that such conversion may be
> language-dependent, that is, it does no generalize well. On the other hand,
> the language extension may be faster that plain ruta rules. Is the
> implementation of this functionality that you have at your company good in
> terms of speed?
>
> Best regards,
> Nikolai KROT
>
> On Wed, Aug 28, 2019 at 1:48 PM Peter Klügl <peter.klu...@averbis.com>
> wrote:
>
>> Hi,
>>
>>
>> we (Averbis) have an annotator which does exactly what you describe, but
>> unfortunetly I cannot share it.  However, I can tell that the annotator
>> is almost completely implemented in Ruta and uses no Ruta language
>> extensions.
>>
>>
>> If you want to learn more about language extensions, then there are
>> example projects in the Ruta trunk: ruta-core-ext and
>> example-projects/ruta-ep-example-extensions
>>
>>
>> If you want to build the annotator with Ruta rules, I can help you
>> create it.
>>
>>
>> As a starting point you need some dictionaries (wordtables) for numbers
>> (ein;1\neins;1\nzwei;2....) , exponents/multiplicators (tausend;3) and
>> special characters (½). For German that's not too much, maybe one
>> hundred entries overall is a good start.
>>
>> Before you can apply the dictionaries, you need to split the RutaBasics
>> using some conjunction words in order to map the subword segments. You
>> can do that with a simple regex rule:
>>
>> "und" -> ConjunctionFragment;
>>
>> Then, you can write some rules that combine numbers using additions,
>> multiplications and exponents, e.g., something like:
>>
>>
>> FOREACH(num, false) NumericValue{}{
>>
>>         // combination with multipliers like 3 million
>>         (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)}
>> SPECIAL?{REGEXP("-"), NEAR(W,0,1,true)}
>>             (
>>                 Multiplicator{-> num.value = (num.value * (POW(10,
>> Multiplicator.value)))}
>>                 add2:NumericValue?{-> num.value = (num.value +
>> add2.value), UNMARK(add2)}
>>             )*);
>>
>>
>>         // fünfundzwanzig
>>         (num{PARTOF(W)-> SHIFT(NumericValue,1,3)} ConjunctionFragment
>> add:NumericValue.value!=0{PARTOF(W), IF((NumericValue.value%1) == 0) ->
>> UNMARK(add)})
>>             {-> num.value = (num.value + add.value)};
>>
>> }
>>
>>
>> At the end you get about 200 lines of Ruta ...
>>
>>
>>
>>
>> Best,
>>
>>
>> Peter
>>
>> Am 27.08.2019 um 16:30 schrieb Dominik Terweh:
>>> Dear All,
>>>
>>>
>>>
>>> When working with German written out numbers I figured, that in order
>>> to get what I want (the numeric value of a written number) I need to
>>> either hard code every single number name and use Wordtable or I need
>>> to work with the string. However, this made me thinking that this
>>> would probably be better done in a Language Extension. Unfortunately I
>>> am not sure how these work and how I can include them in my project.
>>> Also the manual did not really help me there
>>> (
>> https://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.language.extensions
>> ).
>>>
>>>
>>>
>>> Further I was wondering if there are any readily available extensions
>>> that can be used, e.g. to convert a string of number words into actual
>>> numbers (or replacing words on a dictionary basis, such as “one”:”1”,
>>> “two”:”2”,…), or an extension, that can evaluate a calculation in the
>>> form of a string (like “100*5+55”).  If something exists for number
>>> conversion it would be interesting to see if it does both, annotation
>>> and calculation, and how it handles different languages such as:
>>>
>>> 1) input is one token (like numbers in german, einundzwanzig)
>>>
>>> 2) input is several tokens jointly representing one number (like in
>>> english: twenty two)
>>>
>>> And mixed cases such as:
>>>
>>> 3) input is combination of number and string (like: 10 Millionen)
>>>
>>>
>>>
>>> Thank you in advance for your help,
>>>
>>> Best
>>>
>>> Dominik
>>>
>>> Dominik Terweh
>>> Praktikant
>>>
>>> *Drooms GmbH*
>>> Eschersheimer Landstraße 6
>>> 60322 Frankfurt, Germany
>>> www.drooms.com <http://www.drooms.com>
>>>
>>> Phone:
>>> Mail:         d.ter...@drooms.com <mailto:d.ter...@drooms.com>
>>>
>>> <
>> https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature
>>>
>>> *Drooms GmbH*; Sitz der Gesellschaft / Registered Office:
>>> Eschersheimer Landstr. 6, D-60322 Frankfurt am Main; Geschäftsführung
>>> / Management Board: Alexandre Grellier;
>>> Registergericht / Court of Registration: Amtsgericht Frankfurt am
>>> Main, HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main,
>>> USt-IdNr.: DE 224007190
>>>
>> --
>> Dr. Peter Klügl
>> R&D Text Mining/Machine Learning
>>
>> Averbis GmbH
>> Salzstr. 15
>> 79098 Freiburg
>> Germany
>>
>> Fon: +49 761 708 394 0
>> Fax: +49 761 708 394 10
>> Email: peter.klu...@averbis.com
>> Web: https://averbis.com
>>
>> Headquarters: Freiburg im Breisgau
>> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
>> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó
>>
>>
-- 
Dr. Peter Klügl
R&D Text Mining/Machine Learning

Averbis GmbH
Salzstr. 15
79098 Freiburg
Germany

Fon: +49 761 708 394 0
Fax: +49 761 708 394 10
Email: peter.klu...@averbis.com
Web: https://averbis.com

Headquarters: Freiburg im Breisgau
Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó

Re: Using extensions

Reply via email to