Hi,
we are using a separate component for dictionary lookup, which can combine multiple dictionaries and can also assign arbitrary feature values. Most language-dependent information is extracted to language-specific dictionaries and some language independent dictionaries. There is a ticket to contribute parts of this implementation to Ruta in order to replace WORDLIST, WORDTABLE and TRIE. It's faster, more powerful and more stable. I did not have to time yet to migrate it. Most Ruta rules are language-independent. Some rules focus on different constraints for separators, e.g., the space separator for thousands in some languages instead of commas or periods. I think this combination, as much fast dictionary lookup as possible and then sequential rules for creating more complex expression if necessary, is a good choice with respect to speed vs maintainability. I have no recent numbers concerning throughput, but there have always been another component that would have been optimized before. I see no need for additional Ruta language extensions since they increase the complexity of the pipeline without providing considerable advantages for the use case. Best, Peter Am 29.08.2019 um 11:39 schrieb Nikolai Krot: > Hi Peter, > > From *your* perspective, for this particular task of turning written out > numbers to their numerical representation, what would be better to > implement it as a language extension (= one additional function) or a set > of ruta rules? > Against language extension speaks the fact that such conversion may be > language-dependent, that is, it does no generalize well. On the other hand, > the language extension may be faster that plain ruta rules. Is the > implementation of this functionality that you have at your company good in > terms of speed? > > Best regards, > Nikolai KROT > > On Wed, Aug 28, 2019 at 1:48 PM Peter Klügl <peter.klu...@averbis.com> > wrote: > >> Hi, >> >> >> we (Averbis) have an annotator which does exactly what you describe, but >> unfortunetly I cannot share it. However, I can tell that the annotator >> is almost completely implemented in Ruta and uses no Ruta language >> extensions. >> >> >> If you want to learn more about language extensions, then there are >> example projects in the Ruta trunk: ruta-core-ext and >> example-projects/ruta-ep-example-extensions >> >> >> If you want to build the annotator with Ruta rules, I can help you >> create it. >> >> >> As a starting point you need some dictionaries (wordtables) for numbers >> (ein;1\neins;1\nzwei;2....) , exponents/multiplicators (tausend;3) and >> special characters (½). For German that's not too much, maybe one >> hundred entries overall is a good start. >> >> Before you can apply the dictionaries, you need to split the RutaBasics >> using some conjunction words in order to map the subword segments. You >> can do that with a simple regex rule: >> >> "und" -> ConjunctionFragment; >> >> Then, you can write some rules that combine numbers using additions, >> multiplications and exponents, e.g., something like: >> >> >> FOREACH(num, false) NumericValue{}{ >> >> // combination with multipliers like 3 million >> (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)} >> SPECIAL?{REGEXP("-"), NEAR(W,0,1,true)} >> ( >> Multiplicator{-> num.value = (num.value * (POW(10, >> Multiplicator.value)))} >> add2:NumericValue?{-> num.value = (num.value + >> add2.value), UNMARK(add2)} >> )*); >> >> >> // fünfundzwanzig >> (num{PARTOF(W)-> SHIFT(NumericValue,1,3)} ConjunctionFragment >> add:NumericValue.value!=0{PARTOF(W), IF((NumericValue.value%1) == 0) -> >> UNMARK(add)}) >> {-> num.value = (num.value + add.value)}; >> >> } >> >> >> At the end you get about 200 lines of Ruta ... >> >> >> >> >> Best, >> >> >> Peter >> >> Am 27.08.2019 um 16:30 schrieb Dominik Terweh: >>> Dear All, >>> >>> >>> >>> When working with German written out numbers I figured, that in order >>> to get what I want (the numeric value of a written number) I need to >>> either hard code every single number name and use Wordtable or I need >>> to work with the string. However, this made me thinking that this >>> would probably be better done in a Language Extension. Unfortunately I >>> am not sure how these work and how I can include them in my project. >>> Also the manual did not really help me there >>> ( >> https://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.language.extensions >> ). >>> >>> >>> >>> Further I was wondering if there are any readily available extensions >>> that can be used, e.g. to convert a string of number words into actual >>> numbers (or replacing words on a dictionary basis, such as “one”:”1”, >>> “two”:”2”,…), or an extension, that can evaluate a calculation in the >>> form of a string (like “100*5+55”). If something exists for number >>> conversion it would be interesting to see if it does both, annotation >>> and calculation, and how it handles different languages such as: >>> >>> 1) input is one token (like numbers in german, einundzwanzig) >>> >>> 2) input is several tokens jointly representing one number (like in >>> english: twenty two) >>> >>> And mixed cases such as: >>> >>> 3) input is combination of number and string (like: 10 Millionen) >>> >>> >>> >>> Thank you in advance for your help, >>> >>> Best >>> >>> Dominik >>> >>> Dominik Terweh >>> Praktikant >>> >>> *Drooms GmbH* >>> Eschersheimer Landstraße 6 >>> 60322 Frankfurt, Germany >>> www.drooms.com <http://www.drooms.com> >>> >>> Phone: >>> Mail: d.ter...@drooms.com <mailto:d.ter...@drooms.com> >>> >>> < >> https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature >>> >>> *Drooms GmbH*; Sitz der Gesellschaft / Registered Office: >>> Eschersheimer Landstr. 6, D-60322 Frankfurt am Main; Geschäftsführung >>> / Management Board: Alexandre Grellier; >>> Registergericht / Court of Registration: Amtsgericht Frankfurt am >>> Main, HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main, >>> USt-IdNr.: DE 224007190 >>> >> -- >> Dr. Peter Klügl >> R&D Text Mining/Machine Learning >> >> Averbis GmbH >> Salzstr. 15 >> 79098 Freiburg >> Germany >> >> Fon: +49 761 708 394 0 >> Fax: +49 761 708 394 10 >> Email: peter.klu...@averbis.com >> Web: https://averbis.com >> >> Headquarters: Freiburg im Breisgau >> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 >> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó >> >> -- Dr. Peter Klügl R&D Text Mining/Machine Learning Averbis GmbH Salzstr. 15 79098 Freiburg Germany Fon: +49 761 708 394 0 Fax: +49 761 708 394 10 Email: peter.klu...@averbis.com Web: https://averbis.com Headquarters: Freiburg im Breisgau Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó