Re: [Apertium-stuff] Proper noun classification considered harmful

Flammie A Pirinen Sat, 06 Feb 2021 02:51:31 -0800

Thank you all for a lively discussion, I'll summarise here and reply to
few of the comments in a typical inline reply format. I think as tldr we
agree to some extent that these rich np annotation tags are specific to
language pairs and steps in the pipeline and should not be hindering
unrelated bidixes and stuff...



Am Tue, Feb 02, 2021 at 11:34:40AM +0100 schrieb Kevin Brubeck Unhammer:
> 
> Genders are useful when anaphora resolving / in transfer, though only on
> person names. [...] 
> The <ant>, <cog> and <top> tags are used quite a bit in the nob
> disambiguator, but not in transfer.

I think there is an endless amount of lexical information that can be
recorded that is useful for disambiguations or some intermediate that
could be stored in tags as well, but most of it should not bother e.g.
bidix that has no use for it or all monodixes.
Traditionally lot of this is hidden for
example in CG and other formats in lists/sets of lexemes.

I do think that gendering first names is getting old-fashioned and also
unreliable, most of super common Finnish names I can think will be used
for both genders either locally or internationally or both for example.

> I tend to underspecify np's in bidix:
> 
> <e> <p><l>Iran<s n="np"/></l><r>Iran<s n="np"/></r></p></e>
> <e> <p><l>Thiel<s n="np"/></l><r>Thiel<s n="np"/></r></p></e>
> <e> <p><l>Saruman<s n="np"/></l><r>Saruman<s n="np"/></r></p></e>
> <e> <p><l>Contras<s n="np"/></l><r>Contras<s n="np"/></r></p></e>

I find this would be the ideal, I even start with <i> tag...

> so just the monodixen need to be synced.

That's unlikely to happen for all of apertium-langs...

> The remaining problem is when the analyser gives ^Saruman<np><al>$ and
> you try to send that into a generator that expects ^Saruman<np><ant>$.

Yes and someone else is sure that is ^Saruman<np><cog>$ and maybe
someone else is helpful to say that it is "m" too and... not trying to
be funny, just within last weeks I had to encode something like
'Kristus' as al, ant, ant.m, and cog for such variation in monodixes.

 
> We could perhaps use the Giellatekno solution for that, where dixen have
> RL entries that just contain <np> (ie., no cog/ant/al), and some
> transfer step cleans off the tags. Should be a fairly simple change, and
> it's tried and tested in giella-pairs. Since lttoolbox is used mostly
> for languages where np pardefs are small, adding the RL's is like max
> 10 extra lines; for languages requiring hfst it's probably a fairly
> simple twol or xfregex rule?
> 

yeah I think having optionalise-filter for those tags in generator would
be ok solution, that also allows using the tags if there is some reason
for it, I can perhaps see different paradigms between same lemma with
different semantics happening...



Héctor said:
> Let's see the example of New-York in
> French. The city is "New-York" without any article but the state in "le
> New-York". The prepositions used in both cases are different in some cases
> (which come to be often in Wikipedia texts). So, they have different
> behaviour in French. In principle, it makes sense to differentiate them in
> the monodix... although I have preferred not to innovate too much, and, as
> you suggest, I've used long def-lists in the transfer files.

This is actually a good example, in theory Finnish has similar feature
that place names can prefer either inner or outer locative case systems.
Does this mean that every monodix in apertium should contain tag for
np.top's for <case_inner> or <case_outer>? Of course, language specific
details are indeed best encoded in e.g. lists and sets.


Bernard said:


> So, an alternative possibility should be to add extra files in language
> branch for when this language is the target language. These files (wordlists)
> could be used in tranfer without making more complicated bidixes. So, the
> same file could be written once and used in a lot of languge pairs. But if
> the wordlist is long, I don't know if that would degrade transfer speed
> performance compared to adding this information in any bidix fo which it is
> useful.

This is what I do with omorfi, a lot of people implementing various
applications have needed different lexical tags, so these can optionally
be joined in to the analyses

-- 
Regards, Flammie <https://flammie.github.io>
(Please note, that I will often include my replies inline instead of
top or bottom of the mail)

signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Proper noun classification considered harmful

Reply via email to