Re: [Apertium-stuff] New Occitan-French release

Tino Didriksen Fri, 04 Nov 2022 02:02:36 -0700

On Fri, 4 Nov 2022 at 08:22, Hèctor Alòs i Font <hectora...@gmail.com>
wrote:


> 1) We need a first CG process that finds out whether the text has
> enunciatives. Probably it should return somehow 0 or 1. How?
> 2) Depending on this, we will have two slightly different pipes, but
> how? Should the syntax of the modes.xml be expanded to include a kind
> of "if-else"?
>
> More generally, it would be desirable to have a first step that
> recognises from which variety of Occitan we are translating.
> Currently, we force the user to say whether he is translating from
> Languedocien (called "Occitan" in Apertium and "Occitan Languedocien"
> in the translator of the Congrès Permanent de la Lenga Occitana). A
> user does not necessarily know it. When there are two possibilities,
> there is not too much of a problem: try one and, if it doesn't work
> too well, try the other. But when we have four or more variants, it
> will be less obvious. But, for now, the question is to differentiate
> between two Gascon "flavours".
>

We can have a program in the single-pass pipe that will hold on to whole
paragraphs at a time, do some analysis on them, and then spit out
https://visl.sdu.dk/cg3/chunked/streamcmds.html#cmd-setvar or similar
metadata before them.

CG can by itself do this with lookahead, but it's not optimized for that
task. But making a hold-for-analysis tool is very easy - we just need to
define how big a chunk is. For documents that pass through Transfuse (HTML,
docx, etc) then the division is roughly on a natural paragraph level. But
for corpus streams we may need to just hold X bytes at a time. Or a
combination thereof.

-- Tino Didriksen

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] New Occitan-French release

Reply via email to