Re: Ruta - MARKFAST

2014-06-30 Thread Peter Klügl
Am 30.06.2014 15:31, schrieb Peter Klügl:
> Am 30.06.2014 14:58, schrieb armin.weg...@bka.bund.de:
>> Hi, Peter!
>>
>> I got that. I restricted MARKFAST on segments. It works just nearly
> perfect. How does MARKFAST match things? Using
>> Document{->MARKFAST(MyType, { "a", "b", "a b" });

Well, when spending another thought about it, then it is clear... The
matching process considers the longest match. I don't think that all
matches are currently supported, but it should not be complicated to add
the functionality. You can open a feature request if you want.

Peter

> hehe... I didn't even remember that this is possible. I will open an
> issue for string lists.
>
> The normal application of MARKFAST is with word lists:
>
> WORDLIST MyList = 'somelist.txt';
> Document{-> MARKFAST(MyType, MyList)};
>
> ... whereas the file somelists.txt contains something like:
>
> a
> b
> a b
>
> Files with endings "twl" and "mtwl" are for compiled dictionaries.
>
> Just to mention:
> The usage of characters (in the word list) that are filtered when
> applying the dictionary lookup may cause unexpected behavior because the
> algorithm may choose the wrong subtree. I happened once in our
> applications until now.
>
> Best,
>
> Peter
>
>
>
>> on
>>
>> a b
>>
>> yields
>>
>> "a b" and "b" but not "a".
>>
>> I would like to have "a" as well. Can this be done?
>>
>> Buy the way: I love Ruta.apply(). That is exactly what I needed.
>>
>> Thanks,
>> Armin
>>
>>
>> -Ursprüngliche Nachricht-
>> Von: Peter Klügl [mailto:pklu...@uni-wuerzburg.de]
>> Gesendet: Montag, 30. Juni 2014 12:51
>> An: user@uima.apache.org
>> Betreff: Re: Ruta - MARKFAST
>>
>> Hi,
>>
>> Am 30.06.2014 11:32, schrieb armin.weg...@bka.bund.de:
>>> Hello!
>>>
>>> On which annotation type does MARFKAST work?
>> It is applied on the annotations, on which the rule element of the
> action matched.
>> Document{-> MARKFAST(...)};
>> ... causes a dictionary lookup on the complete document.
>>
>> Sentence{CONTAINS(...) -> MARKFAST(...)}; ... causes a separate
> dictionary lookup on each of the matched sentences (e.g., no
> inter-sentence annotations).
>>
>>> Can I restrict MARKFAST to a single annotation Type, say my own token
> type?
>> No, but there is an issue that includes this functionality.
>>
>> UIMA-3775: Fast multi token dictionary matching on feature values
>>
>> The idea is the apply the dictionary lookup on sequences feature
> values (e.g., lemmas). If the feature represents the covered text, then
> this would also support your use case. The issue is not top priority
> right now, but if you want, then I can try to include it in the next
> release (August).
>>> It would be nice to restrict a ruta script to a set of annotations by
>>> giving that set of annotations
>> explicitly, like
>>> Document{-> INPUT(Token, Organization, Location)};
>> UIMA Ruta follows a different strategy, e.g., compared to JAPE and its
> input specification. The availability and visibility of annotations is
> not type-based but coverage-based. This enables the easy specification
> of complex patterns, but also complicates the things sometimes. If one
> type is set to invisible (FILTERTYPE), then all annotations of this type
> and all covered annotations of other types are invisible.
>> The MARKFAST action operates on the RutaStream and thus is lookup is
> sensitive to the filtering setting. For example, the lookup ignored
> whitespaces, breaks and markup using the default settings. By extending
> the set of filtered types, you can also change the behavior of the
> dictionary lookup. However, mind that annotations covered by one of the
> types are also not accessible by the dictionary.
>>> All other annotations should be ignored. Is there a way to do this in
>> Ruta? Can this by done with FILTERTYPE and RETAINTYPE? How?
>>
>> Yes, but it depends on the actual occurrences of types in your document.
>> The easiest way is to filter the types of the annotations that cover
> the positions that should be skipped. It's not easy to give a generic
> solution for this.
>> An example:
>> Your tokenizer creates annotations for words and numbers, but not for
> punctuation marks, and you want to apply the dictionary lookup only for
> sequences of token annotations skipping punctuation marks.
>> Document{-> FILTERTYPE(PM)};
>> Document{-> MARKFAST(...)};
>>
>>
>> There are plans to extend and modify the concept of accessibility and
> visibility in UIMA Ruta sometime (>= 3.0.0). Any wishes and opinions are
> welcome :-)
>>
>>
>> Best,
>>
>> Peter
>>
>>
>>>
>>> Cheers,
>>> Armin
>>>
>>
>
>



Re: Ruta - MARKFAST

2014-06-30 Thread Peter Klügl
Am 30.06.2014 14:58, schrieb armin.weg...@bka.bund.de:
> Hi, Peter!
>
> I got that. I restricted MARKFAST on segments. It works just nearly
perfect. How does MARKFAST match things? Using
>
> Document{->MARKFAST(MyType, { "a", "b", "a b" });

hehe... I didn't even remember that this is possible. I will open an
issue for string lists.

The normal application of MARKFAST is with word lists:

WORDLIST MyList = 'somelist.txt';
Document{-> MARKFAST(MyType, MyList)};

... whereas the file somelists.txt contains something like:

a
b
a b

Files with endings "twl" and "mtwl" are for compiled dictionaries.

Just to mention:
The usage of characters (in the word list) that are filtered when
applying the dictionary lookup may cause unexpected behavior because the
algorithm may choose the wrong subtree. I happened once in our
applications until now.

Best,

Peter



>
> on
>
> a b
>
> yields
>
> "a b" and "b" but not "a".
>
> I would like to have "a" as well. Can this be done?
>
> Buy the way: I love Ruta.apply(). That is exactly what I needed.
>
> Thanks,
> Armin
> 
>
> -Ursprüngliche Nachricht-
> Von: Peter Klügl [mailto:pklu...@uni-wuerzburg.de]
> Gesendet: Montag, 30. Juni 2014 12:51
> An: user@uima.apache.org
> Betreff: Re: Ruta - MARKFAST
>
> Hi,
>
> Am 30.06.2014 11:32, schrieb armin.weg...@bka.bund.de:
>> Hello!
>>
>> On which annotation type does MARFKAST work?
>
> It is applied on the annotations, on which the rule element of the
action matched.
>
> Document{-> MARKFAST(...)};
> ... causes a dictionary lookup on the complete document.
>
> Sentence{CONTAINS(...) -> MARKFAST(...)}; ... causes a separate
dictionary lookup on each of the matched sentences (e.g., no
inter-sentence annotations).
>
>
>> Can I restrict MARKFAST to a single annotation Type, say my own token
type?
>
> No, but there is an issue that includes this functionality.
>
> UIMA-3775: Fast multi token dictionary matching on feature values
>
> The idea is the apply the dictionary lookup on sequences feature
values (e.g., lemmas). If the feature represents the covered text, then
this would also support your use case. The issue is not top priority
right now, but if you want, then I can try to include it in the next
release (August).
>
>> It would be nice to restrict a ruta script to a set of annotations by
>> giving that set of annotations
> explicitly, like
>>
>> Document{-> INPUT(Token, Organization, Location)};
>
> UIMA Ruta follows a different strategy, e.g., compared to JAPE and its
input specification. The availability and visibility of annotations is
not type-based but coverage-based. This enables the easy specification
of complex patterns, but also complicates the things sometimes. If one
type is set to invisible (FILTERTYPE), then all annotations of this type
and all covered annotations of other types are invisible.
>
> The MARKFAST action operates on the RutaStream and thus is lookup is
sensitive to the filtering setting. For example, the lookup ignored
whitespaces, breaks and markup using the default settings. By extending
the set of filtered types, you can also change the behavior of the
dictionary lookup. However, mind that annotations covered by one of the
types are also not accessible by the dictionary.
>
>>
>> All other annotations should be ignored. Is there a way to do this in
> Ruta? Can this by done with FILTERTYPE and RETAINTYPE? How?
>
> Yes, but it depends on the actual occurrences of types in your document.
> The easiest way is to filter the types of the annotations that cover
the positions that should be skipped. It's not easy to give a generic
solution for this.
>
> An example:
> Your tokenizer creates annotations for words and numbers, but not for
punctuation marks, and you want to apply the dictionary lookup only for
sequences of token annotations skipping punctuation marks.
>
> Document{-> FILTERTYPE(PM)};
> Document{-> MARKFAST(...)};
>
>
> There are plans to extend and modify the concept of accessibility and
visibility in UIMA Ruta sometime (>= 3.0.0). Any wishes and opinions are
welcome :-)
>
>
>
> Best,
>
> Peter
>
>
>>
>>
>> Cheers,
>> Armin
>>
>
>




AW: Ruta - MARKFAST

2014-06-30 Thread Armin.Wegner
Hi, Peter!

I got that. I restricted MARKFAST on segments. It works just nearly perfect. 
How does MARKFAST match things? Using

Document{->MARKFAST(MyType, { "a", "b", "a b" });

on

a b

yields

"a b" and "b" but not "a".

I would like to have "a" as well. Can this be done?

Buy the way: I love Ruta.apply(). That is exactly what I needed.

Thanks,
Armin
 

-Ursprüngliche Nachricht-
Von: Peter Klügl [mailto:pklu...@uni-wuerzburg.de] 
Gesendet: Montag, 30. Juni 2014 12:51
An: user@uima.apache.org
Betreff: Re: Ruta - MARKFAST

Hi,

Am 30.06.2014 11:32, schrieb armin.weg...@bka.bund.de:
> Hello!
>
> On which annotation type does MARFKAST work?

It is applied on the annotations, on which the rule element of the action 
matched.

Document{-> MARKFAST(...)};
... causes a dictionary lookup on the complete document.

Sentence{CONTAINS(...) -> MARKFAST(...)}; ... causes a separate dictionary 
lookup on each of the matched sentences (e.g., no inter-sentence annotations).


> Can I restrict MARKFAST to a single annotation Type, say my own token type?

No, but there is an issue that includes this functionality.

UIMA-3775: Fast multi token dictionary matching on feature values

The idea is the apply the dictionary lookup on sequences feature values (e.g., 
lemmas). If the feature represents the covered text, then this would also 
support your use case. The issue is not top priority right now, but if you 
want, then I can try to include it in the next release (August).

> It would be nice to restrict a ruta script to a set of annotations by 
> giving that set of annotations
explicitly, like
>
> Document{-> INPUT(Token, Organization, Location)};

UIMA Ruta follows a different strategy, e.g., compared to JAPE and its input 
specification. The availability and visibility of annotations is not type-based 
but coverage-based. This enables the easy specification of complex patterns, 
but also complicates the things sometimes. If one type is set to invisible 
(FILTERTYPE), then all annotations of this type and all covered annotations of 
other types are invisible.

The MARKFAST action operates on the RutaStream and thus is lookup is sensitive 
to the filtering setting. For example, the lookup ignored whitespaces, breaks 
and markup using the default settings. By extending the set of filtered types, 
you can also change the behavior of the dictionary lookup. However, mind that 
annotations covered by one of the types are also not accessible by the 
dictionary.

>
> All other annotations should be ignored. Is there a way to do this in
Ruta? Can this by done with FILTERTYPE and RETAINTYPE? How?

Yes, but it depends on the actual occurrences of types in your document.
The easiest way is to filter the types of the annotations that cover the 
positions that should be skipped. It's not easy to give a generic solution for 
this.

An example:
Your tokenizer creates annotations for words and numbers, but not for 
punctuation marks, and you want to apply the dictionary lookup only for 
sequences of token annotations skipping punctuation marks.

Document{-> FILTERTYPE(PM)};
Document{-> MARKFAST(...)};


There are plans to extend and modify the concept of accessibility and 
visibility in UIMA Ruta sometime (>= 3.0.0). Any wishes and opinions are 
welcome :-)



Best,

Peter


>
>
> Cheers,
> Armin
>




pgpq34lmv1zxF.pgp
Description: PGP signature


Re: Ruta - MARKFAST

2014-06-30 Thread Peter Klügl
Hi,

Am 30.06.2014 11:32, schrieb armin.weg...@bka.bund.de:
> Hello!
>
> On which annotation type does MARFKAST work?

It is applied on the annotations, on which the rule element of the
action matched.

Document{-> MARKFAST(...)};
... causes a dictionary lookup on the complete document.

Sentence{CONTAINS(...) -> MARKFAST(...)};
... causes a separate dictionary lookup on each of the matched sentences
(e.g., no inter-sentence annotations).


> Can I restrict MARKFAST to a single annotation Type, say my own token type?

No, but there is an issue that includes this functionality.

UIMA-3775: Fast multi token dictionary matching on feature values

The idea is the apply the dictionary lookup on sequences feature values
(e.g., lemmas). If the feature represents the covered text, then this
would also support your use case. The issue is not top priority right
now, but if you want, then I can try to include it in the next release
(August).

> It would be nice to restrict a ruta script to a set of annotations by giving 
> that set of annotations
explicitly, like
>
> Document{-> INPUT(Token, Organization, Location)};

UIMA Ruta follows a different strategy, e.g., compared to JAPE and its
input specification. The availability and visibility of annotations is
not type-based but coverage-based. This enables the easy specification
of complex patterns, but also complicates the things sometimes. If one
type is set to invisible (FILTERTYPE), then all annotations of this type
and all covered annotations of other types are invisible.

The MARKFAST action operates on the RutaStream and thus is lookup is
sensitive to the filtering setting. For example, the lookup ignored
whitespaces, breaks and markup using the default settings. By extending
the set of filtered types, you can also change the behavior of the
dictionary lookup. However, mind that annotations covered by one of the
types are also not accessible by the dictionary.

>
> All other annotations should be ignored. Is there a way to do this in
Ruta? Can this by done with FILTERTYPE and RETAINTYPE? How?

Yes, but it depends on the actual occurrences of types in your document.
The easiest way is to filter the types of the annotations that cover the
positions that should be skipped. It's not easy to give a generic
solution for this.

An example:
Your tokenizer creates annotations for words and numbers, but not for
punctuation marks, and you want to apply the dictionary lookup only for
sequences of token annotations skipping punctuation marks.

Document{-> FILTERTYPE(PM)};
Document{-> MARKFAST(...)};


There are plans to extend and modify the concept of accessibility and
visibility in UIMA Ruta sometime (>= 3.0.0). Any wishes and opinions are
welcome :-)



Best,

Peter


>
>
> Cheers,
> Armin
>




Ruta - MARKFAST

2014-06-30 Thread Armin.Wegner
Hello!

On which annotation type does MARFKAST work? Can I restrict MARKFAST to a 
single annotation Type, say my own token type? It would be nice to restrict a 
ruta script to a set of annotations by giving that set of annotations 
explicitly, like

Document{-> INPUT(Token, Organization, Location)};

All other annotations should be ignored. Is there a way to do this in Ruta? Can 
this by done with FILTERTYPE and RETAINTYPE? How?

Cheers,
Armin



pgpQ6A8Ri0Uqd.pgp
Description: PGP signature


Re: AW: Ruta - MARKFAST

2013-05-23 Thread Marshall Schor

On 5/23/2013 9:03 AM, armin.weg...@bka.bund.de wrote:
> Hello Jörn,
>
> absolutely right. But for now I'm still a nooby. That's why I'm asking so 
> much.

Sometimes, noobies make better contributions, because they write for other
noobies :-).  I would encourage you to contribute, anyways.  You can mark up
your contribution with little tags like  etc. to indicate you're not sure an
whoever integrates your patch in should pay more attention.

-Marshall

>
> Cheers,
> Armin
>
>
>
> -Ursprüngliche Nachricht-
> Von: Jörn Kottmann [mailto:kottm...@gmail.com] 
> Gesendet: Donnerstag, 23. Mai 2013 14:24
> An: user@uima.apache.org
> Betreff: Re: Ruta - MARKFAST
>
> On 05/23/2013 01:19 PM, Peter Klügl wrote:
>> That is the official documentation. An up-to-date version that 
>> describes the new features since 2.0.0 can be found in the trunk.
>>
>> I know that there are many passages and section that need to be added 
>> or improved, but it is hard to find enough time for it.
> Another way to improve the documentation is to contribute patches for it, if 
> you use a specific feature of Ruta and know it well enough, just take 10 
> minutes, write some documentation, open a jira issue and attach the patch to 
> it.
>
> Jörn
>
>



AW: Ruta - MARKFAST

2013-05-23 Thread Armin.Wegner
Hello Jörn,

absolutely right. But for now I'm still a nooby. That's why I'm asking so much.

Cheers,
Armin



-Ursprüngliche Nachricht-
Von: Jörn Kottmann [mailto:kottm...@gmail.com] 
Gesendet: Donnerstag, 23. Mai 2013 14:24
An: user@uima.apache.org
Betreff: Re: Ruta - MARKFAST

On 05/23/2013 01:19 PM, Peter Klügl wrote:
> That is the official documentation. An up-to-date version that 
> describes the new features since 2.0.0 can be found in the trunk.
>
> I know that there are many passages and section that need to be added 
> or improved, but it is hard to find enough time for it.

Another way to improve the documentation is to contribute patches for it, if 
you use a specific feature of Ruta and know it well enough, just take 10 
minutes, write some documentation, open a jira issue and attach the patch to it.

Jörn



Re: Ruta - MARKFAST

2013-05-23 Thread Jörn Kottmann

On 05/23/2013 01:19 PM, Peter Klügl wrote:

That is the official documentation. An up-to-date version that describes
the new features since 2.0.0 can be found in the trunk.

I know that there are many passages and section that need to be added or
improved, but it is hard to find enough time for it.


Another way to improve the documentation is to contribute patches for it,
if you use a specific feature of Ruta and know it well enough, just take 
10 minutes,

write some documentation, open a jira issue and attach the patch to it.

Jörn


Re: AW: AW: Ruta - MARKFAST

2013-05-23 Thread Peter Klügl
Hi,

On 23.05.2013 13:06, armin.weg...@bka.bund.de wrote:
> Hello Peter,
>
> Now that I understand it, it's a nice feature.
>
> By the way, where can I find a good documentation of Ruta? I only know of 
> http://people.apache.org/~pkluegl/site/textmarker-current/tools.textmarker.book.html
>  

That is the official documentation. An up-to-date version that describes
the new features since 2.0.0 can be found in the trunk.

I know that there are many passages and section that need to be added or
improved, but it is hard to find enough time for it.

There is ongoing work by others to improve the description of the java
integration for uses cases in part of speech tagging, and we are
planning to provide screencasts for the Ruta Workbench.

Are there any specific passages that should be improved or added? I also
easily forget to add important information (since I implemented it).

> and http://tmwiki.informatik.uni-wuerzburg.de/. A more detailed description 
> would be appreciated.

This wiki refers to the old version hosted at sourceforge and should not
be referred to.

Best,

Peter

> Thanks,
> Armin
>
> -Ursprüngliche Nachricht-
> Von: Peter Klügl [mailto:pklu...@uni-wuerzburg.de] 
> Gesendet: Mittwoch, 22. Mai 2013 15:09
> An: user@uima.apache.org
> Betreff: Re: AW: Ruta - MARKFAST
>
> Hi,
>
> yes this example won't work without changes, because the word list is 
> sensitive to white spaces, e.g., you distinguish between "n.C." and "n.
> C.". I know this sound like a bug, but it is rather a feature.
>
> In order to solve your problem you could either remove all spaces in your 
> word list, you could add "n.Chr." and "v.Chr." (without space) to your word 
> list, or you could retain the spaces before calling MARKFAST (Document{-> 
> RETAINTYPE(SPACE)};)
>
> The short explanation for this is that the action and the word list won't see 
> any spaces with the default filtering settings, thus they check on a 
> candidate like "n.Chr". However, in the trie, there is no "h"
> in that path without space before the "C".
>
> Best,
>
> Peter
>
> On 22.05.2013 10:52, armin.weg...@bka.bund.de wrote:
>> Hi Peter,
>>
>> your example does work perfectly fine. But try this as word list and input 
>> document:
>>
>> nach Christus
>> nach der Zeitenwende
>> n. C.
>> n.C.
>> nC.
>> n. Chr.
>> n. d. Z.
>> n.d.Z.
>> unserer Zeit
>> unserer Zeitrechnung
>> u. Z.
>> u.Z.
>> v. C.
>> v.C.
>> vC.
>> v. Chr.
>> v. d. Z.
>> v.d.Z.
>> vor Christus
>> vor der Zeitenwende
>> vor unserer Zeitrechnung
>> v. u. Z.
>> v.u.Z.
>>
>> "n. Chr." and "v. Chr." are not recognized. Do you have the same result?
>>
>> Cheers,
>> Armin
>>
>>
>> -Ursprüngliche Nachricht-
>> Von: Peter Klügl [mailto:pklu...@uni-wuerzburg.de]
>> Gesendet: Dienstag, 21. Mai 2013 19:58
>> An: user@uima.apache.org
>> Betreff: Re: Ruta - MARKFAST
>>
>> Hi,
>>
>> On 21.05.2013 15:49, armin.weg...@bka.bund.de wrote:
>>> Hello!
>>>
>>> Is there any possibility to match strings like
>>>
>>> nC.
>>> v. Chr.
>>>
>>> with MARKFAST?
>> Yes. Did you observe any problems? I just tested it with:
>>
>> Wordlist:
>> nC.
>> v. Chr.
>>
>> Input document:
>> nC.
>> v. Chr.
>> n C .
>> v . Chr.
>>
>> Script:
>> PACKAGE uima.ruta.tests;
>> WORDLIST testList = 'test.txt';
>> DECLARE Test;
>> Document{->MARKFAST(Test, testList)};
>>
>> ... creates four annotations of type test.
>>
>> Best,
>>
>> Peter
>>
>>
>>
>>> Cheers,
>>> Armin



AW: AW: Ruta - MARKFAST

2013-05-23 Thread Armin.Wegner
Hello Peter,

Now that I understand it, it's a nice feature.

By the way, where can I find a good documentation of Ruta? I only know of 
http://people.apache.org/~pkluegl/site/textmarker-current/tools.textmarker.book.html
 and http://tmwiki.informatik.uni-wuerzburg.de/. A more detailed description 
would be appreciated.

Thanks,
Armin

-Ursprüngliche Nachricht-
Von: Peter Klügl [mailto:pklu...@uni-wuerzburg.de] 
Gesendet: Mittwoch, 22. Mai 2013 15:09
An: user@uima.apache.org
Betreff: Re: AW: Ruta - MARKFAST

Hi,

yes this example won't work without changes, because the word list is sensitive 
to white spaces, e.g., you distinguish between "n.C." and "n.
C.". I know this sound like a bug, but it is rather a feature.

In order to solve your problem you could either remove all spaces in your word 
list, you could add "n.Chr." and "v.Chr." (without space) to your word list, or 
you could retain the spaces before calling MARKFAST (Document{-> 
RETAINTYPE(SPACE)};)

The short explanation for this is that the action and the word list won't see 
any spaces with the default filtering settings, thus they check on a candidate 
like "n.Chr". However, in the trie, there is no "h"
in that path without space before the "C".

Best,

Peter

On 22.05.2013 10:52, armin.weg...@bka.bund.de wrote:
> Hi Peter,
>
> your example does work perfectly fine. But try this as word list and input 
> document:
>
> nach Christus
> nach der Zeitenwende
> n. C.
> n.C.
> nC.
> n. Chr.
> n. d. Z.
> n.d.Z.
> unserer Zeit
> unserer Zeitrechnung
> u. Z.
> u.Z.
> v. C.
> v.C.
> vC.
> v. Chr.
> v. d. Z.
> v.d.Z.
> vor Christus
> vor der Zeitenwende
> vor unserer Zeitrechnung
> v. u. Z.
> v.u.Z.
>
> "n. Chr." and "v. Chr." are not recognized. Do you have the same result?
>
> Cheers,
> Armin
>
>
> -Ursprüngliche Nachricht-
> Von: Peter Klügl [mailto:pklu...@uni-wuerzburg.de]
> Gesendet: Dienstag, 21. Mai 2013 19:58
> An: user@uima.apache.org
> Betreff: Re: Ruta - MARKFAST
>
> Hi,
>
> On 21.05.2013 15:49, armin.weg...@bka.bund.de wrote:
>> Hello!
>>
>> Is there any possibility to match strings like
>>
>> nC.
>> v. Chr.
>>
>> with MARKFAST?
> Yes. Did you observe any problems? I just tested it with:
>
> Wordlist:
> nC.
> v. Chr.
>
> Input document:
> nC.
> v. Chr.
> n C .
> v . Chr.
>
> Script:
> PACKAGE uima.ruta.tests;
> WORDLIST testList = 'test.txt';
> DECLARE Test;
> Document{->MARKFAST(Test, testList)};
>
> ... creates four annotations of type test.
>
> Best,
>
> Peter
>
>
>
>> Cheers,
>> Armin




Re: AW: Ruta - MARKFAST

2013-05-22 Thread Peter Klügl
Hi,

yes this example won't work without changes, because the word list is
sensitive to white spaces, e.g., you distinguish between "n.C." and "n.
C.". I know this sound like a bug, but it is rather a feature.

In order to solve your problem you could either remove all spaces in
your word list, you could add "n.Chr." and "v.Chr." (without space) to
your word list, or you could retain the spaces before calling MARKFAST
(Document{-> RETAINTYPE(SPACE)};)

The short explanation for this is that the action and the word list
won't see any spaces with the default filtering settings, thus they
check on a candidate like "n.Chr". However, in the trie, there is no "h"
in that path without space before the "C".

Best,

Peter

On 22.05.2013 10:52, armin.weg...@bka.bund.de wrote:
> Hi Peter,
>
> your example does work perfectly fine. But try this as word list and input 
> document:
>
> nach Christus
> nach der Zeitenwende
> n. C.
> n.C.
> nC.
> n. Chr.
> n. d. Z.
> n.d.Z.
> unserer Zeit
> unserer Zeitrechnung
> u. Z.
> u.Z.
> v. C.
> v.C.
> vC.
> v. Chr.
> v. d. Z.
> v.d.Z.
> vor Christus
> vor der Zeitenwende
> vor unserer Zeitrechnung
> v. u. Z.
> v.u.Z.
>
> "n. Chr." and "v. Chr." are not recognized. Do you have the same result?
>
> Cheers,
> Armin
>
>
> -Ursprüngliche Nachricht-
> Von: Peter Klügl [mailto:pklu...@uni-wuerzburg.de] 
> Gesendet: Dienstag, 21. Mai 2013 19:58
> An: user@uima.apache.org
> Betreff: Re: Ruta - MARKFAST
>
> Hi,
>
> On 21.05.2013 15:49, armin.weg...@bka.bund.de wrote:
>> Hello!
>>
>> Is there any possibility to match strings like
>>
>> nC.
>> v. Chr.
>>
>> with MARKFAST?
> Yes. Did you observe any problems? I just tested it with:
>
> Wordlist:
> nC.
> v. Chr.
>
> Input document:
> nC.
> v. Chr.
> n C .
> v . Chr.
>
> Script:
> PACKAGE uima.ruta.tests;
> WORDLIST testList = 'test.txt';
> DECLARE Test;
> Document{->MARKFAST(Test, testList)};
>
> ... creates four annotations of type test.
>
> Best,
>
> Peter
>
>
>
>> Cheers,
>> Armin



AW: Ruta - MARKFAST

2013-05-22 Thread Armin.Wegner
Hi Peter,

your example does work perfectly fine. But try this as word list and input 
document:

nach Christus
nach der Zeitenwende
n. C.
n.C.
nC.
n. Chr.
n. d. Z.
n.d.Z.
unserer Zeit
unserer Zeitrechnung
u. Z.
u.Z.
v. C.
v.C.
vC.
v. Chr.
v. d. Z.
v.d.Z.
vor Christus
vor der Zeitenwende
vor unserer Zeitrechnung
v. u. Z.
v.u.Z.

"n. Chr." and "v. Chr." are not recognized. Do you have the same result?

Cheers,
Armin


-Ursprüngliche Nachricht-
Von: Peter Klügl [mailto:pklu...@uni-wuerzburg.de] 
Gesendet: Dienstag, 21. Mai 2013 19:58
An: user@uima.apache.org
Betreff: Re: Ruta - MARKFAST

Hi,

On 21.05.2013 15:49, armin.weg...@bka.bund.de wrote:
> Hello!
>
> Is there any possibility to match strings like
>
> nC.
> v. Chr.
>
> with MARKFAST?

Yes. Did you observe any problems? I just tested it with:

Wordlist:
nC.
v. Chr.

Input document:
nC.
v. Chr.
n C .
v . Chr.

Script:
PACKAGE uima.ruta.tests;
WORDLIST testList = 'test.txt';
DECLARE Test;
Document{->MARKFAST(Test, testList)};

... creates four annotations of type test.

Best,

Peter



> Cheers,
> Armin




Re: Ruta - MARKFAST

2013-05-21 Thread Peter Klügl
Hi,

On 21.05.2013 15:49, armin.weg...@bka.bund.de wrote:
> Hello!
>
> Is there any possibility to match strings like
>
> nC.
> v. Chr.
>
> with MARKFAST?

Yes. Did you observe any problems? I just tested it with:

Wordlist:
nC.
v. Chr.

Input document:
nC.
v. Chr.
n C .
v . Chr.

Script:
PACKAGE uima.ruta.tests;
WORDLIST testList = 'test.txt';
DECLARE Test;
Document{->MARKFAST(Test, testList)};

... creates four annotations of type test.

Best,

Peter



> Cheers,
> Armin



Ruta - MARKFAST

2013-05-21 Thread Armin.Wegner
Hello!

Is there any possibility to match strings like

nC.
v. Chr.

with MARKFAST?

Cheers,
Armin