El 2020-06-14 19:40, Jonathan Washington escribió:
I could see another way to treat cases like mango¹/mango² or ат¹/ат².

If we were to eventually have a module that holds other arbitrary
information through the pipeline, you could have tags added in the
transducer that are immediately offloaded to the arbitrary information
storage, and are then accessible to disambiguation, lexical selection,
and bidix.

For example, you could have mango<n>[sem:fruit] and
mango<n>[sem:handle] (or whatever) returned by the transducer, with
the second part picked off by another module and sent through the
pipeline in some other format.

This is just me thinking out loud.


These are all interesting ideas and questions.

I do not believe that the .lexc or .dix files are the right place for
this kind of information. They were designed for morphology, not
for lexicography.

Here is an example of doing semantic tagging in .lexc:

https://raw.githubusercontent.com/giellalt/lang-sme/develop/src/fst/stems/nouns.lexc

gearretnjárggahas+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Hum:gearret#njárggahass JOHTOLAT ;
gearretriekti+N+Sem/Rule:gearret#riekºti ALBMI ;
gearretstohpu+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Build:gearret#stohºpu ALBMI ;
gearru+N+Sem/Dummytag:gearru ALBMI ;
gearsi+N+Sem/Dummytag:gearºsi GOAHTI-I ;
geasanas+N+Sem/Tool-music:geasanass JOHTOLAT ;
geasehandoarjja+N+Sem/Money:geasehan#doarºjag8 SEAMU ;
geasehanfanas+N+Sem/Veh:geasehan#fatnas MALIS ;
geasehangaska+N+Sem/Dummytag:geasehan#gasºka GOAHTI-A ;
geasehangollu+N+Sem/Money:geasehan#gollu GOAHTI-U ;

https://github.com/giellalt/lang-sme/blob/develop/src/cg3/disambiguator.cg3

And here is an example from the CG:

REMOVE:KillLocPl (Pl Loc) IF (0 (Sg Com) LINK 0 TOOL OR Sem/Clth OR Sem/Body OR Sem/Money OR NUMUNIT OR Sem/Sign OR Sem/Perc-emo OR Sem/Lang OR Sem/Domain - ("ealáhus") OR VEHICLE OR PROSEANTA OR Sem/Feat-phys OR Sem/State-sick OR ("lasáhus") OR ("tihttel") LINK NOT 0 ("čalbmi") OR ("suotna") OR Sem/Hum) (NEGATE -1 (Dem Pl Loc))(NEGATE 0 (Pl Loc) LINK *-1 OKTA BARRIER NOT-NPMOD) ;

Having a separate file would be an option.

Another question is, should this be source language, target language, or both?

In general, the idea in Apertium has been to not make distinctions unless necessary for translation... the concept of the 'free ride'. We generally don't need to
distinguish the subsenses of estación when translating from Spanish to
Catalan as estació will generally suffice.

If this is then a translation issue, then the appropriate point of the pipeline is lexical selection and the right unit to disambiguate is ^mango/manec/mango$.

If they are considered separate lexemes, then the right place for disambiguation is in morphological disambiguation ^mango/mango¹/mango²$, and then a module might look like something that reweights analyses using a bag of words technique.

So maybe a secondary file which stores structured information that can be read by any of the modules. The key could be a reading or a prefix of a reading and the file could contain e.g. embeddings (tuned for either POS tagging or lexical
selection), subcategorisation information for transfer, e.g. triples of
verb + adposition + noun. It could also contain things like sets of syntactic
labels with weights/frequencies.

Fran


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to