Daniel Swanson <awesomeevildu...@gmail.com> čálii: > Greetings Apertiumers! > > This morning I set out to change the Ancient Hebrew analyzer from > Latin script to Hebrew script (a task I don't wish upon anyone) and in > the process produced a search-and-replace tool that understands the > structure of several of our source files: > https://github.com/mr-martian/apertium-grep
Awesome! > This script could, without too much trouble, be expanded to cover the > rest of our source files, at which point I would like to propose that > we move towards greater standardization of our tagset: > https://wiki.apertium.org/wiki/List_of_symbols > > At minimum, I would like to deal with some of the duplicate tags, like > impf/imperf, rec/res, v/vblex, pass/pasv, etc. That would be great! I'll put in a vote for pasv right now. > My preference would be that we also consider splitting compound tags, > like the tense+mood (fti, fts, pii, pis) and maybe possessor and > subject tags (px1sg, s_1sg). It makes sense to split tense and mood, as well as number and person, but I doubt it can be done automatically – it will require careful changes to CG and transfer. Might make sense to try it on one language pair along with the maintainer and see how it goes. It would be very dangerous to turn <pxsg> into <px><sg> – that would break lots of CG and transfer rules and possibly lead to more complexity in tag matching since you now have to always check for the existence of <px> whereever you check for <sg> etc. > And if we wanted to go really crazy we > could consider a broader rewrite like changing our tags to UD-style > feature-value pairs (so <sg> becomes <Number=Sing>), though I don't > imagine we actually want to go nearly that far. In principle I like it, but in practice I find it too noisy to be worthwhile. Feature-value pairs are very useful if there's several features that can have the same value, e.g. booleans Seen=true or numbers Age=65, and for linguistics there are some use-cases like SubjNumber vs ObjNumber, or <sg> vs <pxsg>. But for the majority of the values for most languages, the value indicates its feature (only Voice can have the value Passive, only Case can have the value Genitive etc.). So we'd end up with longer lines and debug output and lots of docs needing rewriting and – most importantly – people needing retraining, only for the sake of avoiding sticking a "apertium-to-ud" script in a pipeline.
signature.asc
Description: PGP signature
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff