Re: [Apertium-stuff] New Occitan-French release

Kevin Brubeck Unhammer Mon, 31 Oct 2022 13:31:17 -0700

Congrats on the release!

And that documentation is impressive :)


> 1) We have a serious problem in the translation from Gascon into French.
> The basic issue is that some Gascon speakers use something called
> enunciatives and others do not. These enunciatives, when they are used, are
> found in every sentence and, what is worse, they are homographs with other
> words of very high frequency. At present, we take it for granted that
> Gascon sentences have an enunciative. The problem is that if they do not,
> the disambiguator tends to assign the enunciative function to homographs
> because, by definition, there must be at least one enunciative in every
> sentence.

(With the caveat that I have no idea what enunciatives are), one option
might be to set a variable in CG if you find evidence that the text
doesn't use enunciatives, and then for the remainder of the text remove
enunciative readings if the variable is set. If every sentence of an
enon speaker must have one enon, then finding a sentence without one
would be evidence they don't speak enon:

  SETVARIABLE (non-enon) (1) (*) IF (NEGATE 0* (enon)) ;

If you know that "que" can't be enon before "xyzzy", you could prepend
that rule with

  "<que>" REMOVE (enon) IF (1 ("xyzzy")) ;

and so on, so that the rule is more likely to hit.

Then just

  REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;

which will keep removing for all sentences of the translation.

That will have to be reset at some point, especially if using in server
(I can't remember if cg-proc already resets all variables on null
flush?) or for corpus runs. At the very least

  REMVARIABLE (non-enon) IF (0C (enon)) ;

Testing it sounds challenging.

> 2) Occitan is very diverse: not only because of its six major dialects (+
> transition areas + regions outside the borders of France with other contact
> languages), but also because of the internal variation within each of them.
> The example of the Gascon enunciative is just one of the stuff that could
> be mentioned from Gascon alone. It would be interesting to use the system
> implemented for Nynorsk to produce sub-varieties.

Highly recommended. We have 52 preference choices now (that's 2^52
possible combinations? which I believe may be higher than the number of
Nynorsk users), but with

* only one generator fst
* only one bidix fst

ie. no compilation slowdown, and a cleaner Nynorsk dix – because we had
to clean up stuff in order to do this (previously variants "løk and
"lauk" were separate lemmas, now they're one lemma with a spelling
pardef applied).

signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New Occitan-French release

Reply via email to