Re: [Apertium-stuff] New Occitan-French release

2022-11-05 Thread Hèctor Alòs i Font
Missatge de Kevin Brubeck Unhammer  del dia dv., 4
de nov. 2022 a les 11:31:
>
> What if you do
>
> lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin | cg-proc oci.rlx.bin | 
> …
>
> The first CG step would output a stream variable, so that what the next
> step sees is
>
> []
> ^que/que/que$
> [more text here]
>
> If the next step is CG, it's just
>
>  REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;
>
> ie. remove enunciatives whenever the var is set.

I see. Yes, this is much easier than I though. Thanks, Kevin (and Tino
for the second mail on the matter).

@Aure Séguier , the solution I think is this one and Tino has added
the syntax explanations. When you have time, you can make the rules
for enondetect.rlx, as you proposed. You'll do it better than me.
Adding this step in modes.xml for Gascon is trivial with Kevin's
system (for Languedocien it is not necessary). I don't think you need
many rules. It's more like having a slightly wide window. Bearing in
mind that in Gascon texts, where enunciatives are used, they must be
found in every sentence, I don't think that a very wide window is
necessary to find cases that allow us to conclude without any doubt
whether or not we are in front of a text with enunciatives.

Hèctor


___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-11-04 Thread Hèctor Alòs i Font
Missatge de Aure Séguier  del dia dv., 4 de nov.
2022 a les 15:00:

> Hi,
>
> I can help to make rules to know if there are enunciatives in a text or no.
>
> About recognising which variety of occitan we are translating, we are
> currently developping a tool that can differentiate every dialect of
> occitan, but it isn't very efficient. Between Gascon and other dialects,
> it's OK, because Gascon is so different (except for Aranese, which is a
> subdialect of Gascon). But between Languedocien and other varieties
> (Provençal, Limousin...) there are many confusions.
>
> At first, I was thinking about adding a "all dialects" dialect for the
> oc->fr direction. It would be useful when people don't know the dialect of
> a text, or for texts with many dialects (e.g. a website with articles in
> many dialects, like newspapers). Is that something that was already done
> for another language ? Is it something that could be easily done ?
>

This is the way it works for Catalan and Portuguese. They use the v tag
instead of the alt tag in the dictionaries. The people who initially
developed Occitan in Apertium preferred not to do so. Occitan is too
diverse. Each variety already has a lot of very frequent homographs because
the spelling rules have nothing to distinguish them (unlike French,
Spanish, Italian, Catalan...). But when several varieties are added, the
problem is much bigger. Think of the Provençal article. If we know that the
text is Provençal, disambiguation is much easier. Or if we know that it is
Gascon with enunciatives, we also know what we can find, etc. I myself
immediately switch to a "Gascon" mode when I read it because its syntax is
quite different from the rest (+ enclitics, + concordance of verb
tenses...). This information is basic to have a correct disambiguation.


> Thanks
> Aura Séguier, responsabla de projèctes e desvolopaira
> Lo Congrès permanent de la lenga occitana
> Ciutat - Creem !, 5-7 rue de la Fontaine, 64000 Pau
> T. +33 (0)5 32 00 00 64
> a.segu...@locongres.org
> www.locongres.org
> Le 04/11/2022 à 09:30, Kevin Brubeck Unhammer a écrit :
>
> What if you do
>
> lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin | cg-proc oci.rlx.bin | 
> …
>
> The first CG step would output a stream variable, so that what the next
> step sees is
>
> []
> ^que/que/que$
> [more text here]
>
> If the next step is CG, it's just
>
>  REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;
>
> ie. remove enunciatives whenever the var is set.
>
> One can also unset it in the middle of the stream (if doing corpus
> runs), so output of the enon-detector is
>
> []
> ^que/que/que$
> [more text here]
> []
> ^que/que/que$
> [more text here]
>
> and the REMOVE:var-is-set rule will remove enunciatives in the first
> part, not after seeing the REMVARIABLE.
>
>
> Then the problem of looking several windows ahead is restricted to that
> first enon-detector step.
>
>
> 
>
> Alternatively, if we assume all the input is of the same language, we
> just don't know what language it is ahead of time, then you could
> do several passes, where one is a detector pipeline like
>
> lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin
>
> that outputs the STREAMCMD and then Apy would grep for that, and insert
> the STREAMCMD at the start of the call to the regular pipeline
>
> lt-proc oci.automorf.bin | cg-proc oci.rlx.bin | …
>
> That won't automatically work in modes files, and won't work for corpus
> tests if the corpus has a mix, but OTOH you could use 'export
> AP_SETVAR=non-enon' to force the regular pipeline to insert the
> STREAMCMD at the start.
>
>
>
>
> ___
> Apertium-stuff mailing 
> listApertium-stuff@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-11-04 Thread Aure Séguier

Hi,

I can help to make rules to know if there are enunciatives in a text or no.

About recognising which variety of occitan we are translating, we are 
currently developping a tool that can differentiate every dialect of 
occitan, but it isn't very efficient. Between Gascon and other dialects, 
it's OK, because Gascon is so different (except for Aranese, which is a 
subdialect of Gascon). But between Languedocien and other varieties 
(Provençal, Limousin...) there are many confusions.


At first, I was thinking about adding a "all dialects" dialect for the 
oc->fr direction. It would be useful when people don't know the dialect 
of a text, or for texts with many dialects (e.g. a website with articles 
in many dialects, like newspapers). Is that something that was already 
done for another language ? Is it something that could be easily done ?


Thanks

Aura Séguier, responsabla de projèctes e desvolopaira
Lo Congrès permanent de la lenga occitana
Ciutat - Creem !, 5-7 rue de la Fontaine, 64000 Pau
T. +33 (0)5 32 00 00 64
a.segu...@locongres.org 
www.locongres.org 
Le 04/11/2022 à 09:30, Kevin Brubeck Unhammer a écrit :

What if you do

lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin | cg-proc oci.rlx.bin | …

The first CG step would output a stream variable, so that what the next
step sees is

[]
^que/que/que$
[more text here]

If the next step is CG, it's just

  REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;

ie. remove enunciatives whenever the var is set.

One can also unset it in the middle of the stream (if doing corpus
runs), so output of the enon-detector is

[]
^que/que/que$
[more text here]
[]
^que/que/que$
[more text here]

and the REMOVE:var-is-set rule will remove enunciatives in the first
part, not after seeing the REMVARIABLE.


Then the problem of looking several windows ahead is restricted to that
first enon-detector step.




Alternatively, if we assume all the input is of the same language, we
just don't know what language it is ahead of time, then you could
do several passes, where one is a detector pipeline like

lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin

that outputs the STREAMCMD and then Apy would grep for that, and insert
the STREAMCMD at the start of the call to the regular pipeline

lt-proc oci.automorf.bin | cg-proc oci.rlx.bin | …

That won't automatically work in modes files, and won't work for corpus
tests if the corpus has a mix, but OTOH you could use 'export
AP_SETVAR=non-enon' to force the regular pipeline to insert the
STREAMCMD at the start.



___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-11-04 Thread Tino Didriksen
On Fri, 4 Nov 2022 at 08:22, Hèctor Alòs i Font 
wrote:

> 1) We need a first CG process that finds out whether the text has
> enunciatives. Probably it should return somehow 0 or 1. How?
> 2) Depending on this, we will have two slightly different pipes, but
> how? Should the syntax of the modes.xml be expanded to include a kind
> of "if-else"?
>
> More generally, it would be desirable to have a first step that
> recognises from which variety of Occitan we are translating.
> Currently, we force the user to say whether he is translating from
> Languedocien (called "Occitan" in Apertium and "Occitan Languedocien"
> in the translator of the Congrès Permanent de la Lenga Occitana). A
> user does not necessarily know it. When there are two possibilities,
> there is not too much of a problem: try one and, if it doesn't work
> too well, try the other. But when we have four or more variants, it
> will be less obvious. But, for now, the question is to differentiate
> between two Gascon "flavours".
>

We can have a program in the single-pass pipe that will hold on to whole
paragraphs at a time, do some analysis on them, and then spit out
https://visl.sdu.dk/cg3/chunked/streamcmds.html#cmd-setvar or similar
metadata before them.

CG can by itself do this with lookahead, but it's not optimized for that
task. But making a hold-for-analysis tool is very easy - we just need to
define how big a chunk is. For documents that pass through Transfuse (HTML,
docx, etc) then the division is roughly on a natural paragraph level. But
for corpus streams we may need to just hold X bytes at a time. Or a
combination thereof.

-- Tino Didriksen
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-11-04 Thread Kevin Brubeck Unhammer
What if you do

lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin | cg-proc oci.rlx.bin | …

The first CG step would output a stream variable, so that what the next
step sees is

[]
^que/que/que$ 
[more text here]

If the next step is CG, it's just

 REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;

ie. remove enunciatives whenever the var is set.

One can also unset it in the middle of the stream (if doing corpus
runs), so output of the enon-detector is

[]
^que/que/que$ 
[more text here]
[]
^que/que/que$
[more text here]

and the REMOVE:var-is-set rule will remove enunciatives in the first
part, not after seeing the REMVARIABLE.


Then the problem of looking several windows ahead is restricted to that
first enon-detector step.




Alternatively, if we assume all the input is of the same language, we
just don't know what language it is ahead of time, then you could
do several passes, where one is a detector pipeline like

lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin

that outputs the STREAMCMD and then Apy would grep for that, and insert
the STREAMCMD at the start of the call to the regular pipeline

lt-proc oci.automorf.bin | cg-proc oci.rlx.bin | …

That won't automatically work in modes files, and won't work for corpus
tests if the corpus has a mix, but OTOH you could use 'export
AP_SETVAR=non-enon' to force the regular pipeline to insert the
STREAMCMD at the start.



signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-11-04 Thread Hèctor Alòs i Font
Missatge de Tino Didriksen  del dia dj., 3 de
nov. 2022 a les 15:58:
>
> On Tue, 1 Nov 2022 at 11:45, Kevin Brubeck Unhammer  wrote:
>>
>> Hèctor Alòs i Font 
>> čálii:
>>
>> > As for your proposal, I do not yet have sufficient knowledge of CG to fully
>> > understand it. My idea would be to make a first pass through a whole text
>> > to understand if enunciatives are used in it (for example, recognising
>> > other, more infrequent, but more easily recognisable enunciatives). In the
>> > solution you propose, it seems that this knowledge is acquired
>> > progressively, as sentences are translated. I fear that "que" is so messy
>> > that at least the first sentences of a text would have the same problems as
>> > we have now when we translate a Gascon text without enunciatives.
>>
>> That should be possible too, though I'm not sure how feasible it is to
>> get CG to go that far into a text. By default, CG keeps a context of two
>> windows, but that's configurable. It should be possible (perhaps with
>> minor modifications to cg-proc) to read a bunch of sentences and use
>> Window Spanning tests https://visl.sdu.dk/cg3/single/#test-spanning
>>
>> Tino, have you tried looking ahead several paragraphs, are there any
>> downsides? This should be a fairly simple rule file.
>
>
> The max I've seen in production is 9 windows, but there is no hard limit. 
> Just have to be careful of spanning tests, as they are going to look ahead 
> for every active window. A multi-pass system will perform better, and for 
> this particular task I'd say multi-pass is the correct approach.
>

So I thought, but then:

1) We need a first CG process that finds out whether the text has
enunciatives. Probably it should return somehow 0 or 1. How?
2) Depending on this, we will have two slightly different pipes, but
how? Should the syntax of the modes.xml be expanded to include a kind
of "if-else"?

More generally, it would be desirable to have a first step that
recognises from which variety of Occitan we are translating.
Currently, we force the user to say whether he is translating from
Languedocien (called "Occitan" in Apertium and "Occitan Languedocien"
in the translator of the Congrès Permanent de la Lenga Occitana). A
user does not necessarily know it. When there are two possibilities,
there is not too much of a problem: try one and, if it doesn't work
too well, try the other. But when we have four or more variants, it
will be less obvious. But, for now, the question is to differentiate
between two Gascon "flavours".

Hèctor


___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-11-03 Thread Tino Didriksen
On Tue, 1 Nov 2022 at 11:45, Kevin Brubeck Unhammer 
wrote:

> Hèctor Alòs i Font 
> čálii:
>
> > As for your proposal, I do not yet have sufficient knowledge of CG to
> fully
> > understand it. My idea would be to make a first pass through a whole text
> > to understand if enunciatives are used in it (for example, recognising
> > other, more infrequent, but more easily recognisable enunciatives). In
> the
> > solution you propose, it seems that this knowledge is acquired
> > progressively, as sentences are translated. I fear that "que" is so messy
> > that at least the first sentences of a text would have the same problems
> as
> > we have now when we translate a Gascon text without enunciatives.
>
> That should be possible too, though I'm not sure how feasible it is to
> get CG to go that far into a text. By default, CG keeps a context of two
> windows, but that's configurable. It should be possible (perhaps with
> minor modifications to cg-proc) to read a bunch of sentences and use
> Window Spanning tests https://visl.sdu.dk/cg3/single/#test-spanning
>
> Tino, have you tried looking ahead several paragraphs, are there any
> downsides? This should be a fairly simple rule file.
>

The max I've seen in production is 9 windows, but there is no hard limit.
Just have to be careful of spanning tests, as they are going to look ahead
for every active window. A multi-pass system will perform better, and for
this particular task I'd say multi-pass is the correct approach.

-- Tino Didriksen
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-11-01 Thread Hèctor Alòs i Font
Missatge de Kevin Brubeck Unhammer  del dia dt., 1 de
nov. 2022 a les 13:46:

> Hèctor Alòs i Font 
> čálii:
>
> > Enunciatives are a kind of adverbs that are put just before verbs in main
> > clauses (although they can also be found in subordinate clauses too). For
> > affirmative clauses, it works like the English reinforcement "do" in "I
> do
> > like", but it is syntactically compulsory for enunciative users, so it's
> > not seen as a reinforcement. The problem is that for affirmative clauses
> > the enunciative is "que", which can be cnjsub (=that), rel (=that,
> which),
> > prn.itg (=what, which) and a comparative (=than). Note that cnjsub, rel
> and
> > prn.itg are often right in front of the verb in Occitan too. For
> negative,
> > interrogative and exclamatory clauses other words can be used, but also
> > "que"... which makes all the thing a big mess. (And there are more with
> > dubitative, emphatic, etc. meanings).
> >
> > As for your proposal, I do not yet have sufficient knowledge of CG to
> fully
> > understand it. My idea would be to make a first pass through a whole text
> > to understand if enunciatives are used in it (for example, recognising
> > other, more infrequent, but more easily recognisable enunciatives). In
> the
> > solution you propose, it seems that this knowledge is acquired
> > progressively, as sentences are translated. I fear that "que" is so messy
> > that at least the first sentences of a text would have the same problems
> as
> > we have now when we translate a Gascon text without enunciatives.
>
> That should be possible too, though I'm not sure how feasible it is to
> get CG to go that far into a text. By default, CG keeps a context of two
> windows, but that's configurable. It should be possible (perhaps with
> minor modifications to cg-proc) to read a bunch of sentences and use
> Window Spanning tests https://visl.sdu.dk/cg3/single/#test-spanning
>
> Tino, have you tried looking ahead several paragraphs, are there any
> downsides? This should be a fairly simple rule file.
>
> > This sounds perfect for Occitan. Is there a documentation in the wiki?
>
> There is! See:
>
> https://wiki.apertium.org/wiki/Dialectal_or_standard_variation#Overlapping_variants


Thanks a lot, Kevin, especially for the new updates!

Best,

Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-11-01 Thread Kevin Brubeck Unhammer
Hèctor Alòs i Font 
čálii:

> Enunciatives are a kind of adverbs that are put just before verbs in main
> clauses (although they can also be found in subordinate clauses too). For
> affirmative clauses, it works like the English reinforcement "do" in "I do
> like", but it is syntactically compulsory for enunciative users, so it's
> not seen as a reinforcement. The problem is that for affirmative clauses
> the enunciative is "que", which can be cnjsub (=that), rel (=that, which),
> prn.itg (=what, which) and a comparative (=than). Note that cnjsub, rel and
> prn.itg are often right in front of the verb in Occitan too. For negative,
> interrogative and exclamatory clauses other words can be used, but also
> "que"... which makes all the thing a big mess. (And there are more with
> dubitative, emphatic, etc. meanings).
>
> As for your proposal, I do not yet have sufficient knowledge of CG to fully
> understand it. My idea would be to make a first pass through a whole text
> to understand if enunciatives are used in it (for example, recognising
> other, more infrequent, but more easily recognisable enunciatives). In the
> solution you propose, it seems that this knowledge is acquired
> progressively, as sentences are translated. I fear that "que" is so messy
> that at least the first sentences of a text would have the same problems as
> we have now when we translate a Gascon text without enunciatives.

That should be possible too, though I'm not sure how feasible it is to
get CG to go that far into a text. By default, CG keeps a context of two
windows, but that's configurable. It should be possible (perhaps with
minor modifications to cg-proc) to read a bunch of sentences and use
Window Spanning tests https://visl.sdu.dk/cg3/single/#test-spanning

Tino, have you tried looking ahead several paragraphs, are there any
downsides? This should be a fairly simple rule file.

> This sounds perfect for Occitan. Is there a documentation in the wiki?

There is! See:
https://wiki.apertium.org/wiki/Dialectal_or_standard_variation#Overlapping_variants




signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-10-31 Thread Hèctor Alòs i Font
Thanks a lot for the feedback, Kevin. Some comments added in-line.

Missatge de Kevin Brubeck Unhammer  del dia dl., 31
d’oct. 2022 a les 23:31:

> Congrats on the release!
>
> And that documentation is impressive :)
>
> > 1) We have a serious problem in the translation from Gascon into French.
> > The basic issue is that some Gascon speakers use something called
> > enunciatives and others do not. These enunciatives, when they are used,
> are
> > found in every sentence and, what is worse, they are homographs with
> other
> > words of very high frequency. At present, we take it for granted that
> > Gascon sentences have an enunciative. The problem is that if they do not,
> > the disambiguator tends to assign the enunciative function to homographs
> > because, by definition, there must be at least one enunciative in every
> > sentence.
>
> (With the caveat that I have no idea what enunciatives are), one option
> might be to set a variable in CG if you find evidence that the text
> doesn't use enunciatives, and then for the remainder of the text remove
> enunciative readings if the variable is set. If every sentence of an
> enon speaker must have one enon, then finding a sentence without one
> would be evidence they don't speak enon:
>
>   SETVARIABLE (non-enon) (1) (*) IF (NEGATE 0* (enon)) ;
>
> If you know that "que" can't be enon before "xyzzy", you could prepend
> that rule with
>
>   "" REMOVE (enon) IF (1 ("xyzzy")) ;
>
> and so on, so that the rule is more likely to hit.
>
> Then just
>
>   REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;
>
> which will keep removing for all sentences of the translation.
>
> That will have to be reset at some point, especially if using in server
> (I can't remember if cg-proc already resets all variables on null
> flush?) or for corpus runs. At the very least
>
>   REMVARIABLE (non-enon) IF (0C (enon)) ;
>
> Testing it sounds challenging.
>


Enunciatives are a kind of adverbs that are put just before verbs in main
clauses (although they can also be found in subordinate clauses too). For
affirmative clauses, it works like the English reinforcement "do" in "I do
like", but it is syntactically compulsory for enunciative users, so it's
not seen as a reinforcement. The problem is that for affirmative clauses
the enunciative is "que", which can be cnjsub (=that), rel (=that, which),
prn.itg (=what, which) and a comparative (=than). Note that cnjsub, rel and
prn.itg are often right in front of the verb in Occitan too. For negative,
interrogative and exclamatory clauses other words can be used, but also
"que"... which makes all the thing a big mess. (And there are more with
dubitative, emphatic, etc. meanings).

As for your proposal, I do not yet have sufficient knowledge of CG to fully
understand it. My idea would be to make a first pass through a whole text
to understand if enunciatives are used in it (for example, recognising
other, more infrequent, but more easily recognisable enunciatives). In the
solution you propose, it seems that this knowledge is acquired
progressively, as sentences are translated. I fear that "que" is so messy
that at least the first sentences of a text would have the same problems as
we have now when we translate a Gascon text without enunciatives.


>
> > 2) Occitan is very diverse: not only because of its six major dialects (+
> > transition areas + regions outside the borders of France with other
> contact
> > languages), but also because of the internal variation within each of
> them.
> > The example of the Gascon enunciative is just one of the stuff that could
> > be mentioned from Gascon alone. It would be interesting to use the system
> > implemented for Nynorsk to produce sub-varieties.
>
> Highly recommended. We have 52 preference choices now (that's 2^52
> possible combinations? which I believe may be higher than the number of
> Nynorsk users), but with
>
> * only one generator fst
> * only one bidix fst
>
> ie. no compilation slowdown, and a cleaner Nynorsk dix – because we had
> to clean up stuff in order to do this (previously variants "løk and
> "lauk" were separate lemmas, now they're one lemma with a spelling
> pardef applied).
>

This sounds perfect for Occitan. Is there a documentation in the wiki?

Best,
Hèctor



> ___
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] New Occitan-French release

2022-10-31 Thread Kevin Brubeck Unhammer
Congrats on the release!

And that documentation is impressive :) 

> 1) We have a serious problem in the translation from Gascon into French.
> The basic issue is that some Gascon speakers use something called
> enunciatives and others do not. These enunciatives, when they are used, are
> found in every sentence and, what is worse, they are homographs with other
> words of very high frequency. At present, we take it for granted that
> Gascon sentences have an enunciative. The problem is that if they do not,
> the disambiguator tends to assign the enunciative function to homographs
> because, by definition, there must be at least one enunciative in every
> sentence.

(With the caveat that I have no idea what enunciatives are), one option
might be to set a variable in CG if you find evidence that the text
doesn't use enunciatives, and then for the remainder of the text remove
enunciative readings if the variable is set. If every sentence of an
enon speaker must have one enon, then finding a sentence without one
would be evidence they don't speak enon:

  SETVARIABLE (non-enon) (1) (*) IF (NEGATE 0* (enon)) ;

If you know that "que" can't be enon before "xyzzy", you could prepend
that rule with

  "" REMOVE (enon) IF (1 ("xyzzy")) ;

and so on, so that the rule is more likely to hit.

Then just

  REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;

which will keep removing for all sentences of the translation.

That will have to be reset at some point, especially if using in server
(I can't remember if cg-proc already resets all variables on null
flush?) or for corpus runs. At the very least

  REMVARIABLE (non-enon) IF (0C (enon)) ;

Testing it sounds challenging.

> 2) Occitan is very diverse: not only because of its six major dialects (+
> transition areas + regions outside the borders of France with other contact
> languages), but also because of the internal variation within each of them.
> The example of the Gascon enunciative is just one of the stuff that could
> be mentioned from Gascon alone. It would be interesting to use the system
> implemented for Nynorsk to produce sub-varieties.

Highly recommended. We have 52 preference choices now (that's 2^52
possible combinations? which I believe may be higher than the number of
Nynorsk users), but with

* only one generator fst
* only one bidix fst

ie. no compilation slowdown, and a cleaner Nynorsk dix – because we had
to clean up stuff in order to do this (previously variants "løk and
"lauk" were separate lemmas, now they're one lemma with a spelling
pardef applied).


signature.asc
Description: PGP signature
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


[Apertium-stuff] New Occitan-French release

2022-10-31 Thread Hèctor Alòs i Font
Hi,

A new version of the French-Occitan translator is ready to be packaged and
hopefully will be soon available in the Apertium site.

The previous version was done as a result of 2018 Claudi Balaguer's GSoC. A
one-direction translator from French into Languedocien Occitan was
released. The Occitan dictionary was based on the bidirectional
Occitan-Catalan and Occitan-Spanish translators that are still to date
functioning in self-contained packages of their own, without using shared
dictionaries.

The current version is bidirectional and bidialectal: Languedocien and
Gascon. It has been done with the Congrès permanent de la lenga occitana,
the organisation in charge of the standardisation of the Occitan language.
The Congrès has made available its dictionaries and collaborated in the
development. Mention must also be made of Daniel Swanson, who has been
developing numerous new utilities that we have used. A version using
additional copyrighted dictionaries is available on the Congrès website:
https://revirada.locongres.com

The architecture of the translator is explained here:
https://wiki.apertium.org/wiki/Paire_Occitan-Fran%C3%A7ais (in French). In
short: it uses a multi-level transfer (8-10 transfer steps), lexical
selection and the separable module (bidix: c. 45,000 entries per dialect,
excluding proper nouns; c. 2,000 word selection rules; c. 1,200 multi-word
rules).

There has been no systematic evaluation of the quality of the translator.
Usability tests show that translations into the two variants of Occitan are
frankly good. On the other side, quality is good, but lesser. The great
variety of each of the Occitan variants is a challenge.

The future of development is unclear, but there are three likely directions.

1) We have a serious problem in the translation from Gascon into French.
The basic issue is that some Gascon speakers use something called
enunciatives and others do not. These enunciatives, when they are used, are
found in every sentence and, what is worse, they are homographs with other
words of very high frequency. At present, we take it for granted that
Gascon sentences have an enunciative. The problem is that if they do not,
the disambiguator tends to assign the enunciative function to homographs
because, by definition, there must be at least one enunciative in every
sentence. The way to solve this could be:

a) automatically recognise whether the input text uses enunciatives, and

b) automatically select the translation with a
Gascon_with_enunciative-French or Gascon_without_enunciative-French mode.

Frankly, I don't have much idea how to do either one or the other. Ideas
welcome.

2) Occitan is very diverse: not only because of its six major dialects (+
transition areas + regions outside the borders of France with other contact
languages), but also because of the internal variation within each of them.
The example of the Gascon enunciative is just one of the stuff that could
be mentioned from Gascon alone. It would be interesting to use the system
implemented for Nynorsk to produce sub-varieties.

3) There is a desire to introduce two more varieties of Occitan, including
Provençal. But this is likely to involve a major overhaul of the system
used so far to manage the varieties.

The cause is that the current system makes massive use of the alt tag in
dictionaries to mark varieties. This is inherited from the first Occitan
translators developed some 15 years ago. This tag is similar to the v tag
used to manage the Catalan and Portuguese varieties, but is more
restrictive. The alt tag makes a dictionary entry visible only for the
variety under consideration, while the v tag makes the entry readable, but
not generable, for the other varieties as well. Alt is useful because the
diversity of Occitan is very large and so is its homography (which poses
very serious problems for morphological disambiguation). But alt s not very
suited to deal with transitional varieties. Moreover, it causes a lot of
duplication or near-duplication in dictionaries, which makes them less
readable and manageable. And this with only two varieties: with four or
more it's going to be terrible. And let's not talk about the compilation
time, which are already too long to generate the current four translators
every time we type "make").

Kind regards,
Hèctor
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff