Re: Usage of anchors

2019-08-30 Thread Peter Klügl
Hi,

Am 29.08.2019 um 17:31 schrieb Nikolai Krot:
> Hi Peter,
>
> Thank you for your answer. Is this the relevant issue:
> https://issues.apache.org/jira/browse/UIMA-3862 ?


Yes. (The description should really be more informative)


>
> Honestly, your answer is a revelation for me :) I originally though that
> matching on literals should be faster because no extra step of preliminary
> annotation thereof is required. Can I expect a speed up if I implement the
> rules as follows:
>
> 1) all/most of literals that are found in rules are first wrapped into an
> annotation, say, WRD;
>
>MARKTABLE(WDR, VocabularyOfWordsAppearingInRules);
>
> 2) the rules that rely on these literals are rewritten to be something like
> this:
>
> ... @WRD.ct == "hello" ... {-> ACTION1};
> ... @WRD.ct == "world" ... {-> ACTION2};
>
> Im just curious. We are trying to figure out what is the best tactics of
> writing the rules to guarantee they work am schnellsten.


Yes, I would assume that it is faster. However, it depends on many
factors, e.g., the distributions of the words, length of the document
and the length of the rule and index of the anchor.

I would recommend the usage of FOREACH in order to avoid redudant index
matches on the same annotation.

In my use cases, the initialization of the stream is often relatively
expensive since there are many Ruta compoments in a pipeline that each
reindex the RutaBasics anew. Thus, the speed of a rule is sometimes not
as important as the combination with other annotators.


Best,


Peter


> Best regards,
> Nikolai
>
>
> On Thu, Aug 29, 2019 at 3:26 PM Peter Klügl 
> wrote:
>
>> Hi,
>>
>> Am 29.08.2019 um 15:21 schrieb Nikolai Krot:
>>> Hi Peter,
>>>
>>> thank you for your answer. Can you confirm my understanding (i have
>> certain
>>> difficulty understanding stacked negations)
>>>
>>> * it may be a problem if a literal string in a rule is also an anchor
>>> (either explicitly set by user or selected by rule interpreter)
>>
>> yes, it is especially inefficient because there is no index on the
>> covered text. The rule element needs to evaluate very RutaBasic in the
>> current window (document) by comparing the covered text to the string
>> value. It is of course much slower since you could normally restrict the
>> type of annotation somehow  and use an annotation index.
>>
>>
>> Best,
>>
>>
>> Peter
>>
>>
>>> Best regards,
>>> Nikolai
>>>
>>> On Thu, Aug 29, 2019 at 2:27 PM Peter Klügl 
>>> wrote:
>>>
 Hi,


 the second option should be preferred at least until UIMA-3862 is
 resolved with some additional indexing.

 It is of course not so problematic if the literal matching condition is
 not the starting anchor. However, it is still annoying that the rule
 lements need to be designed according the dynamic partitioning of the
 RutaBasis. This easily leads to problems is larger pipelines.


 Best,


 Peter


 Am 29.08.2019 um 11:59 schrieb Nikolai Krot:
> Hi Peter,
>
> I have a question about this comment of yours:
>
> < ... but the matching using literal string expression is still really
> inefficient.
>
> What do you mean by "inefficient"? Do you mean it is slow? Say, if I
>> want
> to use a literal in one hundred rules, what is a better strategy:
> 1) writing the string literally in every of these 100 rules; or
> 2) annotating the string (using MARKTABLE) and they using the
>> annotation
 in
> these 100 rules?
>
> Best regards,
> Nikolai
>
> On Mon, Aug 26, 2019 at 2:27 PM Peter Klügl 
> wrote:
>
>> Hi,
>>
>>
>> Am 21.08.2019 um 15:47 schrieb Dominik Terweh:
>>> Hi Peter,
>>>
>>> Thanks a lot for the clarification. I was wondering about (10) too.
>>>
>>> Following your explanation I was wondering, Does it make sense to
 anchor
>> sequences, such as in (8) and is it "legal" to use multiple anchors in
>> hierarchical fashion?
>>> Like A @(B @C D)?
>> Yes, it is "legal", but you have to be careful. (There are not enough
>> unit tests for those rules)
>>
>>
>>> Also, is there a difference between the processing of sequences of
>> annotations or literals (given "A" is annotated as A and so on)?
>>> A @(B C D)
>>> Vs
>>> "A" @("B" "C" "D")
>>> Vs
>>> A @("B" C "D")
>> It should not make a difference for the result, but the matching using
>> literal string epxression is still really inefficient.
>>
>>
>> Best,
>>
>>
>> Peter
>>
>>
>>> Best
>>> Dominik
>>>
>>>
>>>
>>> Dominik Terweh
>>> Praktikant
>>>
>>> DROOMS
>>>
>>>
>>> Drooms GmbH
>>> Eschersheimer Landstraße 6
>>> 60322 Frankfurt, Germany
>>> www.drooms.com
>>>
>>> Phone:
>>> Fax:
>>> Mail: d.ter...@drooms.com
>>>
>>>
>>> Subscribe to the Drooms n

Re: Using extensions

2019-08-30 Thread Peter Klügl
Hi,

Am 29.08.2019 um 15:34 schrieb Dominik Terweh:
> Hey,
>
> I tried to understand the rules that you suggested and have a few questions 
> (see below).
> What we have (successfully) implemented so far is a set of rules that change 
> the value of the stored string, in order to produce some kind of expression 
> that is evaluated subsequently:
> a) replace numbers: "eins" becomes "(1)", "zwei|zwan" becomes "(2)"...
> b) replaced factors: "zig" becomes "*(10)", "hundert" becomes "*(100)" 
> and remove "and"
> c) other ruta rules interpret the expression in chain-like order
>
> "dreimillionenzweitausendvierhunderteinundzwanzig"
> a) "(3)millionen(2)tausend(4)hundert(1)und(2)zig"
> b) "(3)*(100)(2)*(1000)(4)(100)(1)(20)"
> c) "(3)*(100)(2)*(1000)(400)(21)" => "(3)*(100)(2)*(1000)(421)" => 
> "(300)(2000)(421)" => "(300)(2421)" => "(3002421)"
>
> However, we use replaceAll(string, pattern, patter) in all these 
> transformations and fear that it might not be the optimal solution for UIMA 
> Ruta.
> Do you have any suggestion?


Why do you want to use a string feature to represent the numeric value?

I would assume that switching to a double/int feature makes it a lot
easier as you can directly perform the calculations.

Btw here's our type system for numeric values:

https://github.com/averbis/core-typesystems/blob/master/numeric-value-typesystem/src/main/resources/de/averbis/textanalysis/typesystems/NumericValueTypeSystem.xml


>
> Here are the questions for your rules:
> 1)
>> Before you can apply the dictionaries, you need to split the RutaBasics  
>> using some conjunction words in order to map the subword segments.
> How exactly can I do that? I know there is SPLIT() but that can only split an 
> annotation
> on the basic of another inlaying one, or do I understand it wrong?
> Because if I could split words then German agglutinated numbers would be no 
> problem (since we have a working solution for English).


In Ruta, you can use simple regex rules for splitting up annotations. If
you have a rule like:
"und" -> ConjunctionFragment;

Then the "und" within the word fünfundzwanzig is annotated with the type
ConjunctionFragment since the simple regex rules are not bound to
annotations at all.
However, as a result, the RutaBasics will be updated. First there was
only one for the W, afterwards there are three. The WORDTABLE operates
on RutaBasic annotations and therefore is able to find "fünf"=5 and
"zwanzig"=20


> 2)
> Is there a special reason, why you use 3 for 'thousand', when you use it with 
> POW(10, x)? Intuitively I would just use 1000.


No, I think someone (me?) thought it would be more elegant.


>
> 3)
> In your "combination with multipliers like 3 million"-rule (Rule 1), you 
> shift the annotation to span over (1,4), should it not be (1,3)?


ah yes, that's a typo.


> 4)
> In Rule 1, is num{IS(NumericValue) )-> SHIFT(NumericValue,1,4)} just a 
> different way of writing num:NumericValue{)-> SHIFT(NumericValue,1,4)}?


The "num" is the variable of the FOREACH block, which in this case
operates from right to left.

So, all rules of the block are performed on the each NumericValue
successively. It is a bit more like an FST. The reverse order was
selected due to some calculations.

Your second rule would be performed on all NumericValue before the next
rule is executed.


>
> 5)
> What exactly is the function of the NEAR() in your Rule 1? Is it there do 
> match only "3", "3-Million" and "3-Million" but not "3-"?


Yes.

(Actually, I would not use NEAR here)


> 6)
> I tried to play Rule 1 through in my head with "zweitausendeins" and 
> "dreimillionenzweitausendeins":
> This works good for the first example


This rule was maybe not a good example afterall.

I have to check it in the context of the block, but AFAIR it would not
be applied for these examples in our rule set (but others).


Best,


Peter


> (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)}
> //value = 2
>
>   (Multiplicator{-> num.value = (2 * (POW(10,3)))}
> //value = 2000
> add2:NumericValue?{-> num.value = (2000 + 1), UNMARK(add2)}));
> //value = 2001
>
>
> But fails for the second:
>
> (num{IS(NumericValue)-> SHIFT(NumericValue,1,4)}
> //value = 3
>
>   (Multiplicator{-> num.value = (2 * (POW(10,6)))}
> //value = 300
> add2:NumericValue?{-> num.value = (300 + 2), UNMARK(add2)})
> //value = 302, after 1st iteration
>
>   (Multiplicator{-> num.value = (302 * (POW(10,3)))}
> //value = 302000
> add2:NumericValue?{-> num.value = (302000+ 1), UNMARK(add2)}));
> //value = 302001
>
> On 28.08.19, 13:48, "Peter Klügl"  wrote:
>
> Hi,
>
>
> we (Averbis) have an annotator which does exactly what you describe, but
> unfortunetly I cannot share it.  However, I can tell that the annotator
> is almost completely implemented in Ruta and uses no Ruta language
> extensions.
>
>
> If you want to learn more about language extensions, then there are
> ex