Flammie A Pirinen <flam...@iki.fi> čálii: > Hi all, > > I've written a handful of apertium-fin-* prototypes and I usually end up > spending way too much time with all the useless subclasses of proper > nouns we have (cogs, ants, als, tops, orgs, and to top all that, > sometimes ms and fs for some extra (mis)gendering). Could we just get > rid of those or those someone have a good use for them? Most of the time > it's very random anyways and we aren't really doing NERing or anything. > I think if these are used in e.g. cg or whatever we should probably have > different way of introducing them that doesn't intervene with > analysis-generation stuffs, like we talked passing by in the last > apertium zoom meeting? Or is there some smart way to bypass them I > haven't thought of (probably)
Genders are useful when anaphora resolving / in transfer, though only on person names. There are some place/org names from swe that have genders (originally from SALDO) which bled into other scandipairs – I'd be happy to remove those since they seem quite useless for us. The <ant>, <cog> and <top> tags are used quite a bit in the nob disambiguator, but not in transfer. I tend to underspecify np's in bidix: <e> <p><l>Iran<s n="np"/></l><r>Iran<s n="np"/></r></p></e> <e> <p><l>Thiel<s n="np"/></l><r>Thiel<s n="np"/></r></p></e> <e> <p><l>Saruman<s n="np"/></l><r>Saruman<s n="np"/></r></p></e> <e> <p><l>Contras<s n="np"/></l><r>Contras<s n="np"/></r></p></e> so just the monodixen need to be synced. If there is an actual bidix-relevant difference, e.g. some place name gets translated but not if it's a person name, then one can specify the tags for just that entry. The remaining problem is when the analyser gives ^Saruman<np><al>$ and you try to send that into a generator that expects ^Saruman<np><ant>$. We could perhaps use the Giellatekno solution for that, where dixen have RL entries that just contain <np> (ie., no cog/ant/al), and some transfer step cleans off the tags. Should be a fairly simple change, and it's tried and tested in giella-pairs. Since lttoolbox is used mostly for languages where np pardefs are small, adding the RL's is like max 10 extra lines; for languages requiring hfst it's probably a fairly simple twol or xfregex rule?
signature.asc
Description: PGP signature
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff