Here come several practical examples. I tried to select them for their
variety. The result is more a wish list than something structured.

Let's begin with "je la baise". Depending on the context this may be "I
kiss her" or "I fuck her". The context can tell us if we are in a formal or
colloquial type of language. Another issue is that in this case the
anaphora resolution can also help us: if the pronoun reference is "hand",
it can only be "kiss"; if it is a person, the doubt persists.

Another kind of problem is the Arpitan words "chamô" ("camel"; plural
"camels") and "chamôs ("chamois"; unchanged in plural). So, translating
into French, I got yesterday chamois in a Bible text of Exodus xD  I solved
it deciding in a CG rule that all "chamôs" (without nothing around in
singular) are camels. (Similar cases in French: fil/fils, foi/fois,
cour/cours)

In French there are plenty of words with different meanings, depending on
the genre: livre, page, tour, etc. The problem is that often the immediate
surrounding context does not disambiguate: des livres, les pages, de tour,
etc. A similar but slightly different case is the word pairs homicide
mf/homicide m, féminicide mf/féminicide m, parricide mf/parricide, etc.:
the one with the genre "mf" is a person and the other is the action.

Other problems come in lexical selection. For instance, as a rule, Catalan
preposition "de" is translated as "de" in French, but if the following word
is a material, "en" must be selected (de fusta > en bois). So in the
Catalan2French lrx file we have a list of materials, as we have a list of
countries, a list of musical instruments, a list of animals, etc. I dream
about a monolingual dictionary where we could get this kind of information.
It is not useful to have these lists for many language pairs using Catalan.
This information should be in apertium-cat and not in every
apertium-cat-xxx lrx file.

Moreover, If we had words not only with different kind of semantic labels,
but also marked as synonyms, maybe it'd be possible to give a translation
using a word labeled as synonym (if it has a translation) instead of
"unknown".

Hèctor

Missatge de Francis Tyers <fty...@prompsit.com> del dia dl., 15 de juny
2020 a les 18:26:

> El 2020-06-15 15:02, Xavi Ivars escribió:
> > Hello,
> >
> > To decouple conversations on how to store secondary information from
> > the use case I had in mind (that can be achieved regardless or how we
> > store and propagate that data), let me explain how I see this
> > functionality working, but using some sort of "apertium pipeline
> > trace" (simplified, many tags missing)
> >
> > This is how we currently handle this "mango" issue in spa-cat:
> > changing the "lemma".
> >
> > This is how I envision it. The key points here are: monolingual module
> > that adds the data to the pipeline. Bilingual module (probably
> > lex-tools?) that makes use of that information to decide the best
> > translation.
> >
> > Please don't look into the exact implementation: there are pieces I
> > don't exactly which module would be the one doing the things. Also,
> > please don't look at the "secondary tags" form to define the
> > semantics: i'm using it just for readability in this example but,
> > again, that data could be persisted anywhere.
> >
> > This is why I thought Tanmai's work could be useful for this: if a
> > module can add this data to the stream, a module later in the pipeline
> > (probably apertium-lex-tools, or biltrans itself?) could use it to
> > decide what the right translation is.
> >
> > Does it make sense?
>
> Thanks Xavi for the ideas...
>
> What I've been thinking about is a module that would go after
> biltrans and before lexical selection. It would essentially reweight
> the possible translations based on a bag of words over a fixed
> window of words or "sentences" (delimited with '.').
>
> You could have source and target components, so e.g. you might
> say that "fruit" is a semantic field or domain which includes,
>
> "mango", "manzana", "plátano", "naranja", ...
>
> and
>
> "mango", "taronja", "poma"
>
> In Catalan. These would be in the monolingual pairs. The
> module would take both lists and the input
>
> ^querer<vblex><pri><p3><sg>/voler<vblex><pri><p3><sg>$
> ^mango<n><m><pl>/mànec<n><m><pl>/mango<n><m><pl>$
> ^y<cnjcoo>/i<cnjcoo>$
> ^manzana<n><f><pl>/poma<n><f><pl>$
>
> And try and maximise semantic coherence, then it could reweight,
> so e.g.
>
> ^querer<vblex><pri><p3><sg>/voler<vblex><pri><p3><sg>$
> ^mango<n><m><pl>/mango<n><m><pl><2.0>/mànec<n><m><pl><0.0>$
> ^y<cnjcoo>/i<cnjcoo>$
> ^manzana<n><f><pl>/poma<n><f><pl>$
>
> And pass it to the lexical selection module which will choose the
> one with the highest weight.
>
> This would mean a new module, but it would require only minor
> changes to the bilingual dictionary and lexical selection, and
> wouldn't have any effect on transfer.
>
> Given a few more examples I'm sure I could come up with a mockup of
> how it would work and we could go from there.
>
> Fran
>
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to