Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

Per Tunedal Tue, 16 Jun 2020 01:21:17 -0700

Hi all,
I liked your examples Hector.

1. Synonyms might be good for a problem in Swedish. As I've mentioned in the 
past, nouns with t-gender (neutrum) in singular, indefinit form, cannot be 
combined with adjectives that ends on the letter "t" or the letter "d". That 
form is not used because it can neither be pronounced, nor written. Usually it 
is noted as "nonexistent" or "not used" in Swedish dictionnaries and grammars.


Some examples:
A lion cannot be afraid! "ett (impossible form of rädd) lejon"
But two lions can: "två rädda lejon"

The same applies to e.g. gosts (spöken) and children (barn). Normally n-genus 
(utrium) is used for anime (living things) in Swedish, but some words have by 
some reason got the "wrong" gender. When encountering such words you have to 
substitute the adjective by a synomym (or reformulate).

2. Genre might be useful for word selection in some cases. In the past I began 
adding info on genre in the Swedish wordlist, for future use.

When choosing a good synonym for "rädd" (afraid) as above, you don't have any 
exact match. Synoms are e.g. "skrajsen" (have got the wind up), genre fam = 
colloquial/casual/informal (familier) or "skräckslagen" (terrified/terror 
struck),genre neu = neutral (neutre) or maybe a bit formal, "rädd" is much more 
current.
(BTW the connotations differ between "rädd" and "skräckslagen", the later is 
stronger ...)

I used the following genres, inspired by Le Petit Robert, by Oxfords Advanced 
Learners Dictionary and by Bonniers svenska ordbok:

neu = neutral (neutre)
sol = solemn(solennel)
fam = colloquial/casual/informal (familier)
pej= depreciatory/pejorative (dénigrant/péjoratif)
vulg= vulgar (vulgaire)
old = old-fashioned (vieilli/archaïque)
dial = dialectal (dialectal)

It might be a good idea to agree on what genres to use, and apply it for all 
languages.

3. In the past I began adding domain info as well in the Swedish wordlist. I 
hoped it might be useful for word selection.
I used e.g. c="domain:general style:fam" in the <e>-tag, as proposed by 
Francis. I haven't got any opinion on the best way to add the info, I'm just 
eager to have the possibility. And a possibility to use the info.

It might be a good idea to agree on domains, as well.

Yours,
Per Tunedal

On Mon, Jun 15, 2020, at 18:38, Hèctor Alòs i Font wrote:
> Here come several practical examples. I tried to select them for their 
> variety. The result is more a wish list than something structured.
> 
> Let's begin with "je la baise". Depending on the context this may be "I kiss 
> her" or "I fuck her". The context can tell us if we are in a formal or 
> colloquial type of language. Another issue is that in this case the anaphora 
> resolution can also help us: if the pronoun reference is "hand", it can only 
> be "kiss"; if it is a person, the doubt persists.
> 
> Another kind of problem is the Arpitan words "chamô" ("camel"; plural 
> "camels") and "chamôs ("chamois"; unchanged in plural). So, translating into 
> French, I got yesterday chamois in a Bible text of Exodus xD I solved it 
> deciding in a CG rule that all "chamôs" (without nothing around in singular) 
> are camels. (Similar cases in French: fil/fils, foi/fois, cour/cours)
> 
> In French there are plenty of words with different meanings, depending on the 
> genre: livre, page, tour, etc. The problem is that often the immediate 
> surrounding context does not disambiguate: des livres, les pages, de tour, 
> etc. A similar but slightly different case is the word pairs homicide 
> mf/homicide m, féminicide mf/féminicide m, parricide mf/parricide, etc.: the 
> one with the genre "mf" is a person and the other is the action.
> 
> Other problems come in lexical selection. For instance, as a rule, Catalan 
> preposition "de" is translated as "de" in French, but if the following word 
> is a material, "en" must be selected (de fusta > en bois). So in the 
> Catalan2French lrx file we have a list of materials, as we have a list of 
> countries, a list of musical instruments, a list of animals, etc. I dream 
> about a monolingual dictionary where we could get this kind of information. 
> It is not useful to have these lists for many language pairs using Catalan. 
> This information should be in apertium-cat and not in every apertium-cat-xxx 
> lrx file.
> 
> Moreover, If we had words not only with different kind of semantic labels, 
> but also marked as synonyms, maybe it'd be possible to give a translation 
> using a word labeled as synonym (if it has a translation) instead of 
> "unknown".
> 
> Hèctor
> 
> Missatge de Francis Tyers <[email protected]> del dia dl., 15 de juny 2020 
> a les 18:26:
>> El 2020-06-15 15:02, Xavi Ivars escribió:
>>  > Hello,
>>  > 
>>  > To decouple conversations on how to store secondary information from
>>  > the use case I had in mind (that can be achieved regardless or how we
>>  > store and propagate that data), let me explain how I see this
>>  > functionality working, but using some sort of "apertium pipeline
>>  > trace" (simplified, many tags missing)
>>  > 
>>  > This is how we currently handle this "mango" issue in spa-cat:
>>  > changing the "lemma".
>>  > 
>>  > This is how I envision it. The key points here are: monolingual module
>>  > that adds the data to the pipeline. Bilingual module (probably
>>  > lex-tools?) that makes use of that information to decide the best
>>  > translation.
>>  > 
>>  > Please don't look into the exact implementation: there are pieces I
>>  > don't exactly which module would be the one doing the things. Also,
>>  > please don't look at the "secondary tags" form to define the
>>  > semantics: i'm using it just for readability in this example but,
>>  > again, that data could be persisted anywhere.
>>  > 
>>  > This is why I thought Tanmai's work could be useful for this: if a
>>  > module can add this data to the stream, a module later in the pipeline
>>  > (probably apertium-lex-tools, or biltrans itself?) could use it to
>>  > decide what the right translation is.
>>  > 
>>  > Does it make sense?
>> 
>>  Thanks Xavi for the ideas...
>> 
>>  What I've been thinking about is a module that would go after
>>  biltrans and before lexical selection. It would essentially reweight
>>  the possible translations based on a bag of words over a fixed
>>  window of words or "sentences" (delimited with '.').
>> 
>>  You could have source and target components, so e.g. you might
>>  say that "fruit" is a semantic field or domain which includes,
>> 
>>  "mango", "manzana", "plátano", "naranja", ...
>> 
>>  and
>> 
>>  "mango", "taronja", "poma"
>> 
>>  In Catalan. These would be in the monolingual pairs. The
>>  module would take both lists and the input
>> 
>>  ^querer<vblex><pri><p3><sg>/voler<vblex><pri><p3><sg>$
>>  ^mango<n><m><pl>/mànec<n><m><pl>/mango<n><m><pl>$
>>  ^y<cnjcoo>/i<cnjcoo>$
>>  ^manzana<n><f><pl>/poma<n><f><pl>$
>> 
>>  And try and maximise semantic coherence, then it could reweight,
>>  so e.g.
>> 
>>  ^querer<vblex><pri><p3><sg>/voler<vblex><pri><p3><sg>$
>>  ^mango<n><m><pl>/mango<n><m><pl><2.0>/mànec<n><m><pl><0.0>$
>>  ^y<cnjcoo>/i<cnjcoo>$
>>  ^manzana<n><f><pl>/poma<n><f><pl>$
>> 
>>  And pass it to the lexical selection module which will choose the
>>  one with the highest weight.
>> 
>>  This would mean a new module, but it would require only minor
>>  changes to the bilingual dictionary and lexical selection, and
>>  wouldn't have any effect on transfer.
>> 
>>  Given a few more examples I'm sure I could come up with a mockup of
>>  how it would work and we could go from there.
>> 
>>  Fran
> 
> _______________________________________________
> Apertium-stuff mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Semantics in Apertium (was Apertium's Wider Use & Secondary Tags)

Reply via email to