[Apertium-stuff] Tagset Standardization

Daniel Swanson Mon, 06 Mar 2023 12:36:26 -0800

Greetings Apertiumers!

This morning I set out to change the Ancient Hebrew analyzer from
Latin script to Hebrew script (a task I don't wish upon anyone) and in
the process produced a search-and-replace tool that understands the
structure of several of our source files:
https://github.com/mr-martian/apertium-grep


This script could, without too much trouble, be expanded to cover the
rest of our source files, at which point I would like to propose that
we move towards greater standardization of our tagset:
https://wiki.apertium.org/wiki/List_of_symbols

At minimum, I would like to deal with some of the duplicate tags, like
impf/imperf, rec/res, v/vblex, pass/pasv, etc.

My preference would be that we also consider splitting compound tags,
like the tense+mood (fti, fts, pii, pis) and maybe possessor and
subject tags (px1sg, s_1sg). And if we wanted to go really crazy we
could consider a broader rewrite like changing our tags to UD-style
feature-value pairs (so <sg> becomes <Number=Sing>), though I don't
imagine we actually want to go nearly that far.

So, given that the effort involved in actually making the change is no
longer the limiting factor, what do we want our tagset to be?

Daniel


_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] Tagset Standardization

Reply via email to