Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)
> > Just because we "can" add information, does not mean we "should". > Yes, I agree. But I think the "material" example that Hèctor raised (*for instance, as a rule, Catalan preposition "de" is translated as "de" in French, but if the following word is a material, "en" must be selected (de fusta > en bois*) is a good one where the transfer (an improved one, for sure) would also benefit on having that information available. Missatge de Francis Tyers del dia dl., 15 de juny 2020 a les 18:45: > El 2020-06-15 17:38, Hèctor Alòs i Font escribió: > > Here come several practical examples. I tried to select them for their > > variety. The result is more a wish list than something structured. > > > > Let's begin with "je la baise". Depending on the context this may be > > "I kiss her" or "I fuck her". The context can tell us if we are in a > > formal or colloquial type of language. Another issue is that in this > > case the anaphora resolution can also help us: if the pronoun > > reference is "hand", it can only be "kiss"; if it is a person, the > > doubt persists. > > > > Another kind of problem is the Arpitan words "chamô" ("camel"; plural > > "camels") and "chamôs ("chamois"; unchanged in plural). So, > > translating into French, I got yesterday chamois in a Bible text of > > Exodus xD I solved it deciding in a CG rule that all "chamôs" > > (without nothing around in singular) are camels. (Similar cases in > > French: fil/fils, foi/fois, cour/cours) > > > > In French there are plenty of words with different meanings, depending > > on the genre: livre, page, tour, etc. The problem is that often the > > immediate surrounding context does not disambiguate: des livres, les > > pages, de tour, etc. A similar but slightly different case is the word > > pairs homicide mf/homicide m, féminicide mf/féminicide m, parricide > > mf/parricide, etc.: the one with the genre "mf" is a person and the > > other is the action. > > > > Other problems come in lexical selection. For instance, as a rule, > > Catalan preposition "de" is translated as "de" in French, but if the > > following word is a material, "en" must be selected (de fusta > en > > bois). So in the Catalan2French lrx file we have a list of materials, > > as we have a list of countries, a list of musical instruments, a list > > of animals, etc. I dream about a monolingual dictionary where we could > > get this kind of information. It is not useful to have these lists for > > many language pairs using Catalan. This information should be in > > apertium-cat and not in every apertium-cat-xxx lrx file. > > > > Moreover, If we had words not only with different kind of semantic > > labels, but also marked as synonyms, maybe it'd be possible to give a > > translation using a word labeled as synonym (if it has a translation) > > instead of "unknown". > > > > These are excellent examples, I'm just about to go out, but will address > them when I get back. Thanks for the ideas.. > > Note that my suggestion was to include this information > in the monolingual packages. > > Fran > -- < Xavi Ivars > < http://xavi.ivars.me > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)
El 2020-06-15 17:38, Hèctor Alòs i Font escribió: Here come several practical examples. I tried to select them for their variety. The result is more a wish list than something structured. Let's begin with "je la baise". Depending on the context this may be "I kiss her" or "I fuck her". The context can tell us if we are in a formal or colloquial type of language. Another issue is that in this case the anaphora resolution can also help us: if the pronoun reference is "hand", it can only be "kiss"; if it is a person, the doubt persists. Another kind of problem is the Arpitan words "chamô" ("camel"; plural "camels") and "chamôs ("chamois"; unchanged in plural). So, translating into French, I got yesterday chamois in a Bible text of Exodus xD I solved it deciding in a CG rule that all "chamôs" (without nothing around in singular) are camels. (Similar cases in French: fil/fils, foi/fois, cour/cours) In French there are plenty of words with different meanings, depending on the genre: livre, page, tour, etc. The problem is that often the immediate surrounding context does not disambiguate: des livres, les pages, de tour, etc. A similar but slightly different case is the word pairs homicide mf/homicide m, féminicide mf/féminicide m, parricide mf/parricide, etc.: the one with the genre "mf" is a person and the other is the action. Other problems come in lexical selection. For instance, as a rule, Catalan preposition "de" is translated as "de" in French, but if the following word is a material, "en" must be selected (de fusta > en bois). So in the Catalan2French lrx file we have a list of materials, as we have a list of countries, a list of musical instruments, a list of animals, etc. I dream about a monolingual dictionary where we could get this kind of information. It is not useful to have these lists for many language pairs using Catalan. This information should be in apertium-cat and not in every apertium-cat-xxx lrx file. Moreover, If we had words not only with different kind of semantic labels, but also marked as synonyms, maybe it'd be possible to give a translation using a word labeled as synonym (if it has a translation) instead of "unknown". These are excellent examples, I'm just about to go out, but will address them when I get back. Thanks for the ideas.. Note that my suggestion was to include this information in the monolingual packages. Fran ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)
El 2020-06-15 17:13, Xavi Ivars escribió: Missatge de Francis Tyers del dia dl., 15 de juny 2020 a les 17:26: [...] And pass it to the lexical selection module which will choose the one with the highest weight. This would mean a new module, but it would require only minor changes to the bilingual dictionary and lexical selection, and wouldn't have any effect on transfer. [...] The difference between your approach and mine is that your proposal is extremely coupled to the order of the modules in the pipeline. The new module would write <2.0> and apertium-lex-tools would need to read and remove it from the pipeline. Doing that part is really trivial. Ideally, I'd like to decouple setting the "domain" of a word from using it. If something just after tagger, still as part of the "analysis" phase of the translation, puts that information in there, then it can be used by "lex-tools", but also by other modules that may need it. If we don't do this, multiple modules may need to read the "domain list" data to assign the right domain to a given word. What are the other cases aside from lexical selection where the domain list would be required? Are there examples of needing to do morphological disambiguation or transfer differently depending on semantic domain? And if this information might help in disambiguation or transfer, would it help substantially over implementing, e.g. word embeddings for the tagger and lexical selection? When I wrote the lexical selection component, I looked into doing word-sense disambiguation on the source side. I didn't find any evidence that it would substantially increase translation performance, e.g. doing WSD without reference to the target language is usually more trouble than it is worth. Although I'm open to being convinced, with evidence... Just because we "can" add information, does not mean we "should". Fran ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)
Here come several practical examples. I tried to select them for their variety. The result is more a wish list than something structured. Let's begin with "je la baise". Depending on the context this may be "I kiss her" or "I fuck her". The context can tell us if we are in a formal or colloquial type of language. Another issue is that in this case the anaphora resolution can also help us: if the pronoun reference is "hand", it can only be "kiss"; if it is a person, the doubt persists. Another kind of problem is the Arpitan words "chamô" ("camel"; plural "camels") and "chamôs ("chamois"; unchanged in plural). So, translating into French, I got yesterday chamois in a Bible text of Exodus xD I solved it deciding in a CG rule that all "chamôs" (without nothing around in singular) are camels. (Similar cases in French: fil/fils, foi/fois, cour/cours) In French there are plenty of words with different meanings, depending on the genre: livre, page, tour, etc. The problem is that often the immediate surrounding context does not disambiguate: des livres, les pages, de tour, etc. A similar but slightly different case is the word pairs homicide mf/homicide m, féminicide mf/féminicide m, parricide mf/parricide, etc.: the one with the genre "mf" is a person and the other is the action. Other problems come in lexical selection. For instance, as a rule, Catalan preposition "de" is translated as "de" in French, but if the following word is a material, "en" must be selected (de fusta > en bois). So in the Catalan2French lrx file we have a list of materials, as we have a list of countries, a list of musical instruments, a list of animals, etc. I dream about a monolingual dictionary where we could get this kind of information. It is not useful to have these lists for many language pairs using Catalan. This information should be in apertium-cat and not in every apertium-cat-xxx lrx file. Moreover, If we had words not only with different kind of semantic labels, but also marked as synonyms, maybe it'd be possible to give a translation using a word labeled as synonym (if it has a translation) instead of "unknown". Hèctor Missatge de Francis Tyers del dia dl., 15 de juny 2020 a les 18:26: > El 2020-06-15 15:02, Xavi Ivars escribió: > > Hello, > > > > To decouple conversations on how to store secondary information from > > the use case I had in mind (that can be achieved regardless or how we > > store and propagate that data), let me explain how I see this > > functionality working, but using some sort of "apertium pipeline > > trace" (simplified, many tags missing) > > > > This is how we currently handle this "mango" issue in spa-cat: > > changing the "lemma". > > > > This is how I envision it. The key points here are: monolingual module > > that adds the data to the pipeline. Bilingual module (probably > > lex-tools?) that makes use of that information to decide the best > > translation. > > > > Please don't look into the exact implementation: there are pieces I > > don't exactly which module would be the one doing the things. Also, > > please don't look at the "secondary tags" form to define the > > semantics: i'm using it just for readability in this example but, > > again, that data could be persisted anywhere. > > > > This is why I thought Tanmai's work could be useful for this: if a > > module can add this data to the stream, a module later in the pipeline > > (probably apertium-lex-tools, or biltrans itself?) could use it to > > decide what the right translation is. > > > > Does it make sense? > > Thanks Xavi for the ideas... > > What I've been thinking about is a module that would go after > biltrans and before lexical selection. It would essentially reweight > the possible translations based on a bag of words over a fixed > window of words or "sentences" (delimited with '.'). > > You could have source and target components, so e.g. you might > say that "fruit" is a semantic field or domain which includes, > > "mango", "manzana", "plátano", "naranja", ... > > and > > "mango", "taronja", "poma" > > In Catalan. These would be in the monolingual pairs. The > module would take both lists and the input > > ^querer/voler$ > ^mango/mànec/mango$ > ^y/i$ > ^manzana/poma$ > > And try and maximise semantic coherence, then it could reweight, > so e.g. > > ^querer/voler$ > ^mango/mango<2.0>/mànec<0.0>$ > ^y/i$ > ^manzana/poma$ > > And pass it to the lexical selection module which will choose the > one with the highest weight. > > This would mean a new module, but it would require only minor > changes to the bilingual dictionary and lexical selection, and > wouldn't have any effect on transfer. > > Given a few more examples I'm sure I could come up with a mockup of > how it would work and we could go from there. > > Fran > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)
Missatge de Francis Tyers del dia dl., 15 de juny 2020 a les 17:26: > > [...] > > And pass it to the lexical selection module which will choose the > one with the highest weight. > > This would mean a new module, but it would require only minor > changes to the bilingual dictionary and lexical selection, and > wouldn't have any effect on transfer. > [...] The difference between your approach and mine is that your proposal is extremely coupled to the order of the modules in the pipeline. The new module would write <2.0> and apertium-lex-tools would need to read and remove it from the pipeline. Ideally, I'd like to decouple setting the "domain" of a word from using it. If something just after tagger, still as part of the "analysis" phase of the translation, puts that information in there, then it can be used by "lex-tools", but also by other modules that may need it. If we don't do this, multiple modules may need to read the "domain list" data to assign the right domain to a given word. -- < Xavi Ivars > < http://xavi.ivars.me > ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff
Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)
El 2020-06-15 15:02, Xavi Ivars escribió: Hello, To decouple conversations on how to store secondary information from the use case I had in mind (that can be achieved regardless or how we store and propagate that data), let me explain how I see this functionality working, but using some sort of "apertium pipeline trace" (simplified, many tags missing) This is how we currently handle this "mango" issue in spa-cat: changing the "lemma". This is how I envision it. The key points here are: monolingual module that adds the data to the pipeline. Bilingual module (probably lex-tools?) that makes use of that information to decide the best translation. Please don't look into the exact implementation: there are pieces I don't exactly which module would be the one doing the things. Also, please don't look at the "secondary tags" form to define the semantics: i'm using it just for readability in this example but, again, that data could be persisted anywhere. This is why I thought Tanmai's work could be useful for this: if a module can add this data to the stream, a module later in the pipeline (probably apertium-lex-tools, or biltrans itself?) could use it to decide what the right translation is. Does it make sense? Thanks Xavi for the ideas... What I've been thinking about is a module that would go after biltrans and before lexical selection. It would essentially reweight the possible translations based on a bag of words over a fixed window of words or "sentences" (delimited with '.'). You could have source and target components, so e.g. you might say that "fruit" is a semantic field or domain which includes, "mango", "manzana", "plátano", "naranja", ... and "mango", "taronja", "poma" In Catalan. These would be in the monolingual pairs. The module would take both lists and the input ^querer/voler$ ^mango/mànec/mango$ ^y/i$ ^manzana/poma$ And try and maximise semantic coherence, then it could reweight, so e.g. ^querer/voler$ ^mango/mango<2.0>/mànec<0.0>$ ^y/i$ ^manzana/poma$ And pass it to the lexical selection module which will choose the one with the highest weight. This would mean a new module, but it would require only minor changes to the bilingual dictionary and lexical selection, and wouldn't have any effect on transfer. Given a few more examples I'm sure I could come up with a mockup of how it would work and we could go from there. Fran ___ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff