On Fri, 4 Nov 2022 at 08:22, Hèctor Alòs i Font <hectora...@gmail.com> wrote:
> 1) We need a first CG process that finds out whether the text has > enunciatives. Probably it should return somehow 0 or 1. How? > 2) Depending on this, we will have two slightly different pipes, but > how? Should the syntax of the modes.xml be expanded to include a kind > of "if-else"? > > More generally, it would be desirable to have a first step that > recognises from which variety of Occitan we are translating. > Currently, we force the user to say whether he is translating from > Languedocien (called "Occitan" in Apertium and "Occitan Languedocien" > in the translator of the Congrès Permanent de la Lenga Occitana). A > user does not necessarily know it. When there are two possibilities, > there is not too much of a problem: try one and, if it doesn't work > too well, try the other. But when we have four or more variants, it > will be less obvious. But, for now, the question is to differentiate > between two Gascon "flavours". > We can have a program in the single-pass pipe that will hold on to whole paragraphs at a time, do some analysis on them, and then spit out https://visl.sdu.dk/cg3/chunked/streamcmds.html#cmd-setvar or similar metadata before them. CG can by itself do this with lookahead, but it's not optimized for that task. But making a hold-for-analysis tool is very easy - we just need to define how big a chunk is. For documents that pass through Transfuse (HTML, docx, etc) then the division is roughly on a natural paragraph level. But for corpus streams we may need to just hold X bytes at a time. Or a combination thereof. -- Tino Didriksen
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff