[Apertium-stuff] Secondary Tag Prefixes

Tino Didriksen Fri, 08 May 2020 07:52:09 -0700

For khannatanmai's GSoC project, secondary tags will be implemented in a
backwards compatible manner. That it in itself indisputable. But, there is
a question of how the initial batch of secondary tags should look.


I feel they should be in the form of <sf:cdefg>, as in a very short textual
lower-case prefix, followed by :, followed by whatever value there is. Or
even an upper-case prefix, as in <S:cdefg> or <SF:cdefg>.

spectie wants symbol prefixes in the form of <%:cdefg>.

This much is fact:
- The tags must have something that identifies them as secondary, so that
tools that don't know specifically what to do with them can at least ignore
them. Hence, any tag containing : is considered secondary. So in theory
even <:cdefg> would be secondary - having an initial : would ease parsing.

- There must be something that identifies which type of secondary it is, so
there must be some sort of separator between the type and the value. : is
nice for this as well, which could lead to e.g. <:sf:cdefg> for surface
forms. But if we are going to require a separator anyway, the initial one
is superfluous so it is dropped, even if it slightly complicates parsing,
leading to <sf:cdefg>.

- Empty prefixes are hard to search for and hard to use in many languages,
where the empty string == null == 0. Thus, a prefix is required.

- Known prefixes must be registered on the wiki so that a given prefix is
used the same across all languages.

- There will not be prefix aliases. Those would lead to many implementation
problems.

So, tags must be in the form of regex <.+:.+> at least. The question is
what should the first .+ match.

spectie likes symbols and finds those easier to read. I find the exact
opposite, that symbols are hard to read.

But the objective differences are:
- Symbols are harder to type, especially on foreign or compact keyboards.
- Symbols are harder to use in regexes.
- Symbols are not even remotely self-documenting, but textual ones are. One
can read <sf:cdefg> and reason that it means surface form, or <t:span> and
reason it means tag. One cannot reason about <%:cdefg> or <!:span>.
- Symbols are limited to the few that everyone can actually type - it's a
tiny namespace.
- Because symbols are limited, someone will eventually want to use a
secondary feature that only matters for a limited set of languages and thus
don't want to take up a symbol prefix, so they will use a textual prefix.
It would be best to not mix prefix styles, hence using textual prefixes
everywhere makes sense.

Examples:
- Lower-case prefixes:
отец<n><sg><gen><@subj><§agent><sf:отца><s:human><s:kin><t:a:ef31><t:span:fcd32>

- Upper-case prefixes:
отец<n><sg><gen><@subj><§agent><S:отца><E:human><E:kin><T:a:ef31><T:span:fcd32>

- Symbol prefixes:
отец<n><sg><gen><@subj><§agent><%:отца><£:human><£:kin><!:a:ef31><!:span:fcd32>

>From a technical and scientific basis, textual prefixes are just better.
And yet, spectie wants symbol prefixes because he likes them. I disagree.
Hence, this mail asking for opinions.

Do you language developers actually prefer symbol prefixes?

-- Tino Didriksen

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] Secondary Tag Prefixes

Reply via email to