El dt 26 de 06 de 2012 a les 15:00 +0200, en/na Víctor M.
Sánchez-Cartagena va escriure:
> Hi,
> 
> 
> I am developing a method to learn Apertium shallow-transfer rules from
> the translations of small chunks provided by non-expert users. In
> order to generalise the learned rules, I check the bilingual
> dictionary which, as far as I know, encodes only the lemma and the
> tags changed when translating from source language to target language.
> However, I have found the following entries in the Spanish-Catalan
> bilingual dictionary:
> 
> 
> <e>       <i>el<s n="det"/><s n="def"/><s n="f"/><s n="pl"/></i></e>
> <e>       <i>el<s n="det"/><s n="def"/><s n="f"/><s n="sg"/></i></e>
> <e>       <i>el<s n="det"/><s n="def"/><s n="m"/><s n="pl"/></i></e>
> <e>       <i>el<s n="det"/><s n="def"/><s n="m"/><s n="sg"/></i></e>
> 
> 
> I think that, as the gender and the number don't change, they could be
> written using only one entry:
> 
> 
> <e>       <i>el<s n="det"/><s n="def"/></i></e>
> 
> 
> It would be very useful for the approach I am developing to remove
> this redundancy from the bilingual dictionary. But, before making a
> commit, I want to be sure that I'm not breaking anything. Do you think
> the proposed change is correct?

It depends on the language pair what strategy is taken. Some language
pairs may have a combination.

I usually encode all relevant data when building a bilingual dictionary,
to make later reuse easier. For example, in the Breton--French
dictionary, I put POS + gender on both sides even if the gender doesn't
change, because that way it makes reuse of the data easier (it means you
don't have to look up (possibly ambiguous) lemmas in the morphology.

I also never use <i> in the bilingual dictionary, and would favour
rewriting entries with <i> to use <p><l></l><r></r></p>. Again, making
reuse easier. 

In the case of determiners, where they decline the same, I think the
following entry would be fine:

<e><p><l>el<s n="det"/><s n="ind"/></l><r>el<s n="det"/><s
n="ind"/></r></p></e>

but, in es-ca you have l' which can be mf/sg, so you would probably need
another entry: 

<e r="RL"><p><l>el<s n="det"/><s n="ind"/><s n="GD"/><s
n="sg"/></l><r>el<s n="det"/><s n="ind"/><s n="mf"/><s
n="sg"/></r></p></e>

In any case I think it is probably not a good idea to assume that the
bilingual dictionary only encodes "different" information. If there is
another way to find it out, it would be better. 

Fran



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to