Re: [Apertium-stuff] Toponims and gender/number in (some) monolingual dictionaries

Tommi A Pirinen Wed, 31 May 2017 03:48:01 -0700

[Replies inline]
On Tue, 30 May 2017 20:27:19 +0200
Xavi Ivars <xavi.iv...@gmail.com> wrote:
> the Catalan monodix (and many other "big" monodixes,
> like the Spanish one) don't have gender and number information for
> toponyms.


Is there a linguistic reason for such decision, or has these been added
from data automatically and the values left open for that reason? I
don't know the languages, but the reason I'm asking is is if toponyms
don't participate in grammatical processes like agreement it may be
reasonable to leave that part out in a monodix.

> On top of that, they're tagged (for historical reasons) as
> "loc", instead of top.

Surely that should just be search&replaced then.

> That makes it harder to properly translate to other languages like
> French, or from other languages, like English: "França és un estat"
> should get translated as "*La* France est un État", where *La *is the
> feminine singular article, but that forces adding all the
> morphological information into the bidix. Similarly, "The United
> States" should be translated into Catalan as "*Els* Estats Units",
> where *Els* is the masculine plural one.

Well, this information should probably be in all monodixes and many
bidixes anyways, grammatical genders surely only match sporadically
even between romance languages and even number often doesn't always
match, e.g. in English one would say "The United States is..." and tag
it sg accordingly. In general bidixes seem to be a collection of this
sort of gender etc. mapping informations, propns don't make a big
difference to that.

> [...]

> After discussing it with Marc and Mikel, and also based on previous
> conversations with Hèctor, seems pretty clear that, in the long term,
> the best option will be update the monolingual dictionaries and
> propagate the changes to all other language pairs that require it.
> But that may be quite a lot of work in the short term.

Well, maintaining language data is hard work, but striving for
correctness of some sort is a worthy goal :-)

> So the questions here would be:
> 
>    - Is there any "easy" way to estimate the work that will need to
> be done to adapt apertium-cat (and apertium-spa) dependent pairs?
> (I'm not even sure what are the "dependent" packages...)

Wouldn't it be relatively easy, since apertium still uses monolithic
svn, you can just go to your svn root of apertium and do some `find
-name '*cat*.dix' -exec grep -c `, similarly someone with good bashing
skills can of course automate a lot of these changes. Do it like so:

1. Change stuff in apertium-cat
2. get lexemes from diff
3. grep for these lexemes in other cats
... Profit.

>    - Can you think of a smaller increment for this change, so we can
> have "slightly different" monodixes (even if it's just at build time)
> so we are not forced to propagate the changes immediately if we want
> to avoid all dependent packages to break?

I don't know if it helps much, but I've done in few places hacks like 

    <pardef n="anyg" c="when developing ignore genders until ambibuity">
      <e><p><l/><r><s n="f"/></r></p></e>
      <e><p><l/><r><s n="m"/></r></p></e>
      <e><p><l/><r><s n="mf"/></r></p></e>
      <e><p><l/><r><s n="mfn"/></r></p></e>
      <e><p><l/><r><s n="nt"/></r></p></e>
    </pardef>

    <pardef n="anynp" c="just ignore np semantic non-sense">
      <e><i><s n="al"/></i><par n="anyg"/></e>
      <e><i><s n="ant"/></i></e>
      <e><i><s n="ant"/><s n="f"/></i></e>
      <e><i><s n="ant"/><s n="m"/></i></e>
      <e><i><s n="ant"/></i><par n="anyg"/></e>
      <e><i><s n="cog"/></i></e>
      <e><i><s n="cog"/></i><par n="anyg"/></e>
      <e><i><s n="org"/></i><par n="anyg"/></e>
      <e><i><s n="top"/></i></e>
      <e><i><s n="top"/></i><par n="anyg"/></e>
      <e><p><l><s n="top"/></l><r><s n="ant"/></r></p><par
    n="anyg"/></e> <e><p><l><s n="ant"/></l><r><s n="top"/></r></p><par
    n="anyg"/></e> <e><p><l/><r><s n="al"/></r></p><par n="anyg"/></e>
      <e><p><l/><r><s n="ant"/></r></p><par n="anyg"/></e>
      <e><p><l/><r><s n="cog"/></r></p><par n="anyg"/></e>
      <e><p><l/><r><s n="org"/></r></p><par n="anyg"/></e>
      <e><p><l/><r><s n="top"/></r></p><par n="anyg"/></e>
    </pardef>

that provides a bit of protection against such changes but only for
one-direction of the translation pair and breaks the other direction.
This can also be combined with gradually changing the things and having
unnecessarily ambiguous readings, but I'm not convinced if that's not
more problem than its worth.


--
Doktor Tommi A Pirinen, Computational Linguist,
<https://flammie.github.io/purplemonkeydishwasher/>, Universität
Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D
Entwickler.  President of ACL SIGUR SIG for Uralic languages
<http://gtweb.uit.no/sigur/>.
I tend to follow inline-posting style in desktop e-mail messages.



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Toponims and gender/number in (some) monolingual dictionaries

Reply via email to