Re: [Apertium-stuff] New Occitan-French release

Kevin Brubeck Unhammer Fri, 04 Nov 2022 01:31:25 -0700

What if you do

lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin | cg-proc oci.rlx.bin | …


The first CG step would output a stream variable, so that what the next
step sees is

[<STREAMCMD:SETVARIABLE:non-enon>]
^que/que<enon>/que<itg>$ 
[more text here]

If the next step is CG, it's just

 REMOVE:var-is-set (enon) IF (0 (VAR:non-enon)) ;

ie. remove enunciatives whenever the var is set.

One can also unset it in the middle of the stream (if doing corpus
runs), so output of the enon-detector is

[<STREAMCMD:SETVARIABLE:non-enon>]
^que/que<enon>/que<itg>$ 
[more text here]
[<STREAMCMD:REMVARIABLE:non-enon>]
^que/que<enon>/que<itg>$
[more text here]

and the REMOVE:var-is-set rule will remove enunciatives in the first
part, not after seeing the REMVARIABLE.


Then the problem of looking several windows ahead is restricted to that
first enon-detector step.


----

Alternatively, if we assume all the input is of the same language, we
just don't know what language it is ahead of time, then you could
do several passes, where one is a detector pipeline like

lt-proc oci.automorf.bin | cg-proc enondetect.rlx.bin

that outputs the STREAMCMD and then Apy would grep for that, and insert
the STREAMCMD at the start of the call to the regular pipeline

lt-proc oci.automorf.bin | cg-proc oci.rlx.bin | …

That won't automatically work in modes files, and won't work for corpus
tests if the corpus has a mix, but OTOH you could use 'export
AP_SETVAR=non-enon' to force the regular pipeline to insert the
STREAMCMD at the start.

signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New Occitan-French release

Reply via email to