El 2020-05-26 12:27, Kevin Brubeck Unhammer escribió:
Xavi Ivars <xavi.iv...@gmail.com> čálii:

* In the trimming disadvantages number 1, we're stating that we're OK
having crappy monodixes because we *fix* that later on with trimming. I'm
sure that's where we are now, but as a project that focuses a lot on
provided free (as in speech) language resources that are later used for many other use cases, I don't feel comfortable with that status. I think we should aim to have as correct as possible dictionaries. And if we did that,
disadvantage number 1 would be smaller (even if not disappearing
completely).

This point seems like distraction. No one puts errors in monodix on
purpose. We do fix errors in monodix (when we find them, and have
time). When we use monodix for other tasks than MT, we find and fix even
more. On the other hand, there's no point in manually going through
every monodix and bloody well searching for errors because there may be
some that may show up if you stop trimming – please spend your time on
something more useful.

But there may also be some confusion as to what is an error. There may
be things in monodixes that don't belong in "regular" dictionaries, but
do belong in monodix – because the goal is building MT systems, not
Dictionaries.

And if your monodix is to be used for other things than MT, you're just
gonna get many more such "weird" entries that all other use-cases need
to filter out. E.g. Giellatekno's Northern Saami analyser (used for MT,
spelling, grammar check etc.) contains several non-normative analyses,
"multiwords" and unusual taggings just for the grammar checker. These
are not included in the FST's built for other use-cases, but are trimmed
out, mostly using tags (but also bidix, in the case of MT).


A better way of doing this kind of "lexicographic" work would be useful, in .lexc-based analysers we mostly use comments, but they are very ad-hoc. Some
examples:

! Use/MT            - Only use this in MT systems
! Src/Bible         - This word came from the Bible
! Err/Orth          - Orthographic error
! Dial/North        - Northern variant
! Use/kaz-kir       - Only use this is kaz-kir
! Use/Circ          - This causes a cycle
! Dir/LR            - Only analysis
! Dir/RL            - Only generation
! Use/MWE           - Multiword
! Der/Caus          - Derived form by causative
! Use/Arch          - Archaic form

Fran


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to