Re: [Apertium-stuff] Tagset Standardization

Kevin Brubeck Unhammer Tue, 07 Mar 2023 03:07:54 -0800

Daniel Swanson
<awesomeevildu...@gmail.com> čálii:

> Greetings Apertiumers!
>
> This morning I set out to change the Ancient Hebrew analyzer from
> Latin script to Hebrew script (a task I don't wish upon anyone) and in
> the process produced a search-and-replace tool that understands the
> structure of several of our source files:
> https://github.com/mr-martian/apertium-grep


Awesome!

> This script could, without too much trouble, be expanded to cover the
> rest of our source files, at which point I would like to propose that
> we move towards greater standardization of our tagset:
> https://wiki.apertium.org/wiki/List_of_symbols
>
> At minimum, I would like to deal with some of the duplicate tags, like
> impf/imperf, rec/res, v/vblex, pass/pasv, etc.

That would be great! I'll put in a vote for pasv right now.

> My preference would be that we also consider splitting compound tags,
> like the tense+mood (fti, fts, pii, pis) and maybe possessor and
> subject tags (px1sg, s_1sg).

It makes sense to split tense and mood, as well as number and person,
but I doubt it can be done automatically – it will require careful
changes to CG and transfer. Might make sense to try it on one language
pair along with the maintainer and see how it goes.

It would be very dangerous to turn <pxsg> into <px><sg> – that would
break lots of CG and transfer rules and possibly lead to more complexity
in tag matching since you now have to always check for the existence of
<px> whereever you check for <sg> etc.

> And if we wanted to go really crazy we
> could consider a broader rewrite like changing our tags to UD-style
> feature-value pairs (so <sg> becomes <Number=Sing>), though I don't
> imagine we actually want to go nearly that far.

In principle I like it, but in practice I find it too noisy to be
worthwhile. Feature-value pairs are very useful if there's several
features that can have the same value, e.g. booleans Seen=true or
numbers Age=65, and for linguistics there are some use-cases like
SubjNumber vs ObjNumber, or <sg> vs <pxsg>. But for the majority of the
values for most languages, the value indicates its feature (only Voice
can have the value Passive, only Case can have the value Genitive etc.).
So we'd end up with longer lines and debug output and lots of docs
needing rewriting and – most importantly – people needing retraining,
only for the sake of avoiding sticking a "apertium-to-ud" script in a
pipeline.

signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Tagset Standardization

Reply via email to