Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

2020-06-15 Thread Xavi Ivars
>
> Just because we "can" add information, does not mean we "should".
>

Yes, I agree. But I think the "material" example that Hèctor raised (*for
instance, as a rule, Catalan preposition "de" is translated as "de" in
French, but if the following word is a material, "en" must be selected (de
fusta > en bois*) is a good one where the transfer (an improved one, for
sure) would also benefit on having that information available.

Missatge de Francis Tyers  del dia dl., 15 de juny
2020 a les 18:45:

> El 2020-06-15 17:38, Hèctor Alòs i Font escribió:
> > Here come several practical examples. I tried to select them for their
> > variety. The result is more a wish list than something structured.
> >
> > Let's begin with "je la baise". Depending on the context this may be
> > "I kiss her" or "I fuck her". The context can tell us if we are in a
> > formal or colloquial type of language. Another issue is that in this
> > case the anaphora resolution can also help us: if the pronoun
> > reference is "hand", it can only be "kiss"; if it is a person, the
> > doubt persists.
> >
> > Another kind of problem is the Arpitan words "chamô" ("camel"; plural
> > "camels") and "chamôs ("chamois"; unchanged in plural). So,
> > translating into French, I got yesterday chamois in a Bible text of
> > Exodus xD  I solved it deciding in a CG rule that all "chamôs"
> > (without nothing around in singular) are camels. (Similar cases in
> > French: fil/fils, foi/fois, cour/cours)
> >
> > In French there are plenty of words with different meanings, depending
> > on the genre: livre, page, tour, etc. The problem is that often the
> > immediate surrounding context does not disambiguate: des livres, les
> > pages, de tour, etc. A similar but slightly different case is the word
> > pairs homicide mf/homicide m, féminicide mf/féminicide m, parricide
> > mf/parricide, etc.: the one with the genre "mf" is a person and the
> > other is the action.
> >
> > Other problems come in lexical selection. For instance, as a rule,
> > Catalan preposition "de" is translated as "de" in French, but if the
> > following word is a material, "en" must be selected (de fusta > en
> > bois). So in the Catalan2French lrx file we have a list of materials,
> > as we have a list of countries, a list of musical instruments, a list
> > of animals, etc. I dream about a monolingual dictionary where we could
> > get this kind of information. It is not useful to have these lists for
> > many language pairs using Catalan. This information should be in
> > apertium-cat and not in every apertium-cat-xxx lrx file.
> >
> > Moreover, If we had words not only with different kind of semantic
> > labels, but also marked as synonyms, maybe it'd be possible to give a
> > translation using a word labeled as synonym (if it has a translation)
> > instead of "unknown".
> >
>
> These are excellent examples, I'm just about to go out, but will address
> them when I get back. Thanks for the ideas..
>
> Note that my suggestion was to include this information
> in the monolingual packages.
>
> Fran
>


-- 
< Xavi Ivars >
< http://xavi.ivars.me >
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

2020-06-15 Thread Francis Tyers

El 2020-06-15 17:38, Hèctor Alòs i Font escribió:

Here come several practical examples. I tried to select them for their
variety. The result is more a wish list than something structured.

Let's begin with "je la baise". Depending on the context this may be
"I kiss her" or "I fuck her". The context can tell us if we are in a
formal or colloquial type of language. Another issue is that in this
case the anaphora resolution can also help us: if the pronoun
reference is "hand", it can only be "kiss"; if it is a person, the
doubt persists.

Another kind of problem is the Arpitan words "chamô" ("camel"; plural
"camels") and "chamôs ("chamois"; unchanged in plural). So,
translating into French, I got yesterday chamois in a Bible text of
Exodus xD  I solved it deciding in a CG rule that all "chamôs"
(without nothing around in singular) are camels. (Similar cases in
French: fil/fils, foi/fois, cour/cours)

In French there are plenty of words with different meanings, depending
on the genre: livre, page, tour, etc. The problem is that often the
immediate surrounding context does not disambiguate: des livres, les
pages, de tour, etc. A similar but slightly different case is the word
pairs homicide mf/homicide m, féminicide mf/féminicide m, parricide
mf/parricide, etc.: the one with the genre "mf" is a person and the
other is the action.

Other problems come in lexical selection. For instance, as a rule,
Catalan preposition "de" is translated as "de" in French, but if the
following word is a material, "en" must be selected (de fusta > en
bois). So in the Catalan2French lrx file we have a list of materials,
as we have a list of countries, a list of musical instruments, a list
of animals, etc. I dream about a monolingual dictionary where we could
get this kind of information. It is not useful to have these lists for
many language pairs using Catalan. This information should be in
apertium-cat and not in every apertium-cat-xxx lrx file.

Moreover, If we had words not only with different kind of semantic
labels, but also marked as synonyms, maybe it'd be possible to give a
translation using a word labeled as synonym (if it has a translation)
instead of "unknown".



These are excellent examples, I'm just about to go out, but will address
them when I get back. Thanks for the ideas..

Note that my suggestion was to include this information
in the monolingual packages.

Fran


___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

2020-06-15 Thread Francis Tyers

El 2020-06-15 17:13, Xavi Ivars escribió:

Missatge de Francis Tyers  del dia dl., 15 de
juny 2020 a les 17:26:


[...]

And pass it to the lexical selection module which will choose the
one with the highest weight.

This would mean a new module, but it would require only minor
changes to the bilingual dictionary and lexical selection, and
wouldn't have any effect on transfer.
[...]


The difference between your approach and mine is that your proposal is
extremely coupled to the order of the modules in the pipeline. The new
module would write <2.0> and apertium-lex-tools would need to read and
remove it from the pipeline.


Doing that part is really trivial.


Ideally, I'd like to decouple setting the "domain" of a word from
using it. If something just after tagger, still as part of the
"analysis" phase of the translation, puts that information in there,
then it can be used by "lex-tools", but also by other modules that may
need it. If we don't do this, multiple modules may need to read the
"domain list" data to assign the right domain to a given word.


What are the other cases aside from lexical selection where the domain
list would be required? Are there examples of needing to do
morphological disambiguation or transfer differently depending on 
semantic

domain?

And if this information might help in disambiguation or transfer, would 
it
help substantially over implementing, e.g. word embeddings for the 
tagger

and lexical selection?

When I wrote the lexical selection component, I looked into doing
word-sense disambiguation on the source side. I didn't find any evidence
that it would substantially increase translation performance, e.g.
doing WSD without reference to the target language is usually more
trouble than it is worth. Although I'm open to being convinced,
with evidence...

Just because we "can" add information, does not mean we "should".

Fran


___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

2020-06-15 Thread Hèctor Alòs i Font
Here come several practical examples. I tried to select them for their
variety. The result is more a wish list than something structured.

Let's begin with "je la baise". Depending on the context this may be "I
kiss her" or "I fuck her". The context can tell us if we are in a formal or
colloquial type of language. Another issue is that in this case the
anaphora resolution can also help us: if the pronoun reference is "hand",
it can only be "kiss"; if it is a person, the doubt persists.

Another kind of problem is the Arpitan words "chamô" ("camel"; plural
"camels") and "chamôs ("chamois"; unchanged in plural). So, translating
into French, I got yesterday chamois in a Bible text of Exodus xD  I solved
it deciding in a CG rule that all "chamôs" (without nothing around in
singular) are camels. (Similar cases in French: fil/fils, foi/fois,
cour/cours)

In French there are plenty of words with different meanings, depending on
the genre: livre, page, tour, etc. The problem is that often the immediate
surrounding context does not disambiguate: des livres, les pages, de tour,
etc. A similar but slightly different case is the word pairs homicide
mf/homicide m, féminicide mf/féminicide m, parricide mf/parricide, etc.:
the one with the genre "mf" is a person and the other is the action.

Other problems come in lexical selection. For instance, as a rule, Catalan
preposition "de" is translated as "de" in French, but if the following word
is a material, "en" must be selected (de fusta > en bois). So in the
Catalan2French lrx file we have a list of materials, as we have a list of
countries, a list of musical instruments, a list of animals, etc. I dream
about a monolingual dictionary where we could get this kind of information.
It is not useful to have these lists for many language pairs using Catalan.
This information should be in apertium-cat and not in every
apertium-cat-xxx lrx file.

Moreover, If we had words not only with different kind of semantic labels,
but also marked as synonyms, maybe it'd be possible to give a translation
using a word labeled as synonym (if it has a translation) instead of
"unknown".

Hèctor

Missatge de Francis Tyers  del dia dl., 15 de juny
2020 a les 18:26:

> El 2020-06-15 15:02, Xavi Ivars escribió:
> > Hello,
> >
> > To decouple conversations on how to store secondary information from
> > the use case I had in mind (that can be achieved regardless or how we
> > store and propagate that data), let me explain how I see this
> > functionality working, but using some sort of "apertium pipeline
> > trace" (simplified, many tags missing)
> >
> > This is how we currently handle this "mango" issue in spa-cat:
> > changing the "lemma".
> >
> > This is how I envision it. The key points here are: monolingual module
> > that adds the data to the pipeline. Bilingual module (probably
> > lex-tools?) that makes use of that information to decide the best
> > translation.
> >
> > Please don't look into the exact implementation: there are pieces I
> > don't exactly which module would be the one doing the things. Also,
> > please don't look at the "secondary tags" form to define the
> > semantics: i'm using it just for readability in this example but,
> > again, that data could be persisted anywhere.
> >
> > This is why I thought Tanmai's work could be useful for this: if a
> > module can add this data to the stream, a module later in the pipeline
> > (probably apertium-lex-tools, or biltrans itself?) could use it to
> > decide what the right translation is.
> >
> > Does it make sense?
>
> Thanks Xavi for the ideas...
>
> What I've been thinking about is a module that would go after
> biltrans and before lexical selection. It would essentially reweight
> the possible translations based on a bag of words over a fixed
> window of words or "sentences" (delimited with '.').
>
> You could have source and target components, so e.g. you might
> say that "fruit" is a semantic field or domain which includes,
>
> "mango", "manzana", "plátano", "naranja", ...
>
> and
>
> "mango", "taronja", "poma"
>
> In Catalan. These would be in the monolingual pairs. The
> module would take both lists and the input
>
> ^querer/voler$
> ^mango/mànec/mango$
> ^y/i$
> ^manzana/poma$
>
> And try and maximise semantic coherence, then it could reweight,
> so e.g.
>
> ^querer/voler$
> ^mango/mango<2.0>/mànec<0.0>$
> ^y/i$
> ^manzana/poma$
>
> And pass it to the lexical selection module which will choose the
> one with the highest weight.
>
> This would mean a new module, but it would require only minor
> changes to the bilingual dictionary and lexical selection, and
> wouldn't have any effect on transfer.
>
> Given a few more examples I'm sure I could come up with a mockup of
> how it would work and we could go from there.
>
> Fran
>
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

2020-06-15 Thread Xavi Ivars
Missatge de Francis Tyers  del dia dl., 15 de juny
2020 a les 17:26:

>
> [...]
>
> And pass it to the lexical selection module which will choose the
> one with the highest weight.
>
> This would mean a new module, but it would require only minor
> changes to the bilingual dictionary and lexical selection, and
> wouldn't have any effect on transfer.
> [...]


The difference between your approach and mine is that your proposal is
extremely coupled to the order of the modules in the pipeline. The new
module would write <2.0> and apertium-lex-tools would need to read and
remove it from the pipeline.

Ideally, I'd like to decouple setting the "domain" of a word from using it.
If something just after tagger, still as part of the "analysis" phase of
the translation, puts that information in there, then it can be used by
"lex-tools", but also by other modules that may need it. If we don't do
this, multiple modules may need to read the "domain list" data to assign
the right domain to a given word.

-- 
< Xavi Ivars >
< http://xavi.ivars.me >
___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

2020-06-15 Thread Francis Tyers

El 2020-06-15 15:02, Xavi Ivars escribió:

Hello,

To decouple conversations on how to store secondary information from
the use case I had in mind (that can be achieved regardless or how we
store and propagate that data), let me explain how I see this
functionality working, but using some sort of "apertium pipeline
trace" (simplified, many tags missing)

This is how we currently handle this "mango" issue in spa-cat:
changing the "lemma".

This is how I envision it. The key points here are: monolingual module
that adds the data to the pipeline. Bilingual module (probably
lex-tools?) that makes use of that information to decide the best
translation.

Please don't look into the exact implementation: there are pieces I
don't exactly which module would be the one doing the things. Also,
please don't look at the "secondary tags" form to define the
semantics: i'm using it just for readability in this example but,
again, that data could be persisted anywhere.

This is why I thought Tanmai's work could be useful for this: if a
module can add this data to the stream, a module later in the pipeline
(probably apertium-lex-tools, or biltrans itself?) could use it to
decide what the right translation is.

Does it make sense?


Thanks Xavi for the ideas...

What I've been thinking about is a module that would go after
biltrans and before lexical selection. It would essentially reweight
the possible translations based on a bag of words over a fixed
window of words or "sentences" (delimited with '.').

You could have source and target components, so e.g. you might
say that "fruit" is a semantic field or domain which includes,

"mango", "manzana", "plátano", "naranja", ...

and

"mango", "taronja", "poma"

In Catalan. These would be in the monolingual pairs. The
module would take both lists and the input

^querer/voler$
^mango/mànec/mango$
^y/i$
^manzana/poma$

And try and maximise semantic coherence, then it could reweight,
so e.g.

^querer/voler$
^mango/mango<2.0>/mànec<0.0>$
^y/i$
^manzana/poma$

And pass it to the lexical selection module which will choose the
one with the highest weight.

This would mean a new module, but it would require only minor
changes to the bilingual dictionary and lexical selection, and
wouldn't have any effect on transfer.

Given a few more examples I'm sure I could come up with a mockup of
how it would work and we could go from there.

Fran


___
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff