Congrats on the release! And that documentation is impressive :)
> 1) We have a serious problem in the translation from Gascon into French. > The basic issue is that some Gascon speakers use something called > enunciatives and others do not. These enunciatives, when they are used, are > found in every sentence and, what is worse, they are homographs with other > words of very high frequency. At present, we take it for granted that > Gascon sentences have an enunciative. The problem is that if they do not, > the disambiguator tends to assign the enunciative function to homographs > because, by definition, there must be at least one enunciative in every > sentence. (With the caveat that I have no idea what enunciatives are), one option might be to set a variable in CG if you find evidence that the text doesn't use enunciatives, and then for the remainder of the text remove enunciative readings if the variable is set. If every sentence of an enon speaker must have one enon, then finding a sentence without one would be evidence they don't speak enon: SETVARIABLE (non-enon) (1) (*) IF (NEGATE 0* (enon)) ; If you know that "que" can't be enon before "xyzzy", you could prepend that rule with "<que>" REMOVE (enon) IF (1 ("xyzzy")) ; and so on, so that the rule is more likely to hit. Then just REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ; which will keep removing for all sentences of the translation. That will have to be reset at some point, especially if using in server (I can't remember if cg-proc already resets all variables on null flush?) or for corpus runs. At the very least REMVARIABLE (non-enon) IF (0C (enon)) ; Testing it sounds challenging. > 2) Occitan is very diverse: not only because of its six major dialects (+ > transition areas + regions outside the borders of France with other contact > languages), but also because of the internal variation within each of them. > The example of the Gascon enunciative is just one of the stuff that could > be mentioned from Gascon alone. It would be interesting to use the system > implemented for Nynorsk to produce sub-varieties. Highly recommended. We have 52 preference choices now (that's 2^52 possible combinations? which I believe may be higher than the number of Nynorsk users), but with * only one generator fst * only one bidix fst ie. no compilation slowdown, and a cleaner Nynorsk dix – because we had to clean up stuff in order to do this (previously variants "løk and "lauk" were separate lemmas, now they're one lemma with a spelling pardef applied).
signature.asc
Description: PGP signature
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff