Hi,
Am 29.08.2019 um 15:34 schrieb Dominik Terweh:
> Hey,
>
> I tried to understand the rules that you suggested and have a few questions
> (see below).
> What we have (successfully) implemented so far is a set of rules that change
> the value of the stored string, in order to produce some kind of expression
> that is evaluated subsequently:
> a) replace numbers: "eins" becomes "(1)", "zwei|zwan" becomes "(2)"...
> b) replaced factors: "zig" becomes "*(10)", "hundert" becomes "*(100)"
> and remove "and"
> c) other ruta rules interpret the expression in chain-like order
>
> "dreimillionenzweitausendvierhunderteinundzwanzig"
> a) "(3)millionen(2)tausend(4)hundert(1)und(2)zig"
> b) "(3)*(100)(2)*(1000)(4)(100)(1)(20)"
> c) "(3)*(100)(2)*(1000)(400)(21)" => "(3)*(100)(2)*(1000)(421)" =>
> "(300)(2000)(421)" => "(300)(2421)" => "(3002421)"
>
> However, we use replaceAll(string, pattern, patter) in all these
> transformations and fear that it might not be the optimal solution for UIMA
> Ruta.
> Do you have any suggestion?
Why do you want to use a string feature to represent the numeric value?
I would assume that switching to a double/int feature makes it a lot
easier as you can directly perform the calculations.
Btw here's our type system for numeric values:
https://github.com/averbis/core-typesystems/blob/master/numeric-value-typesystem/src/main/resources/de/averbis/textanalysis/typesystems/NumericValueTypeSystem.xml
>
> Here are the questions for your rules:
> 1)
>> Before you can apply the dictionaries, you need to split the RutaBasics
>> using some conjunction words in order to map the subword segments.
> How exactly can I do that? I know there is SPLIT() but that can only split an
> annotation
> on the basic of another inlaying one, or do I understand it wrong?
> Because if I could split words then German agglutinated numbers would be no
> problem (since we have a working solution for English).
In Ruta, you can use simple regex rules for splitting up annotations. If
you have a rule like:
"und" -> ConjunctionFragment;
Then the "und" within the word fünfundzwanzig is annotated with the type
ConjunctionFragment since the simple regex rules are not bound to
annotations at all.
However, as a result, the RutaBasics will be updated. First there was
only one for the W, afterwards there are three. The WORDTABLE operates
on RutaBasic annotations and therefore is able to find "fünf"=5 and
"zwanzig"=20
> 2)
> Is there a special reason, why you use 3 for 'thousand', when you use it with
> POW(10, x)? Intuitively I would just use 1000.
No, I think someone (me?) thought it would be more elegant.
>
> 3)
> In your "combination with multipliers like 3 million"-rule (Rule 1), you
> shift the annotation to span over (1,4), should it not be (1,3)?
ah yes, that's a typo.
> 4)
> In Rule 1, is num{IS(NumericValue) )-> SHIFT(NumericValue,1,4)} just a
> different way of writing num:NumericValue{)-> SHIFT(NumericValue,1,4)}?
The "num" is the variable of the FOREACH block, which in this case
operates from right to left.
So, all rules of the block are performed on the each NumericValue
successively. It is a bit more like an FST. The reverse order was
selected due to some calculations.
Your second rule would be performed on all NumericValue before the next
rule is executed.
>
> 5)
> What exactly is the function of the NEAR() in your Rule 1? Is it there do
> match only "3", "3-Million" and "3-Million" but not "3-"?
Yes.
(Actually, I would not use NEAR here)
> 6)
> I tried to play Rule 1 through in my head with "zweitausendeins" and
> "dreimillionenzweitausendeins":
> This works good for the first example
This rule was maybe not a good example afterall.
I have to check it in the context of the block, but AFAIR it would not
be applied for these examples in our rule set (but others).
Best,
Peter
> (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)}
> //value = 2
>
> (Multiplicator{-> num.value = (2 * (POW(10,3)))}
> //value = 2000
> add2:NumericValue?{-> num.value = (2000 + 1), UNMARK(add2)}));
> //value = 2001
>
>
> But fails for the second:
>
> (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)}
> //value = 3
>
> (Multiplicator{-> num.value = (2 * (POW(10,6)))}
> //value = 300
> add2:NumericValue?{-> num.value = (300 + 2), UNMARK(add2)})
> //value = 302, after 1st iteration
>
> (Multiplicator{-> num.value = (302 * (POW(10,3)))}
> //value = 302000
> add2:NumericValue?{-> num.value = (302000+ 1), UNMARK(add2)}));
> //value = 302001
>
> On 28.08.19, 13:48, "Peter Klügl" wrote:
>
> Hi,
>
>
> we (Averbis) have an annotator which does exactly what you describe, but
> unfortunetly I cannot share it. However, I can tell that the annotator
> is almost completely implemented in Ruta and uses no Ruta language
> extensions.
>
>
> If you want to learn more about language extensions, then there are
> ex