El 2020-06-14 19:40, Jonathan Washington escribió:
I could see another way to treat cases like mango¹/mango² or ат¹/ат².
If we were to eventually have a module that holds other arbitrary
information through the pipeline, you could have tags added in the
transducer that are immediately offloaded to the arbitrary information
storage, and are then accessible to disambiguation, lexical selection,
and bidix.
For example, you could have mango<n>[sem:fruit] and
mango<n>[sem:handle] (or whatever) returned by the transducer, with
the second part picked off by another module and sent through the
pipeline in some other format.
This is just me thinking out loud.
These are all interesting ideas and questions.
I do not believe that the .lexc or .dix files are the right place for
this kind of information. They were designed for morphology, not
for lexicography.
Here is an example of doing semantic tagging in .lexc:
https://raw.githubusercontent.com/giellalt/lang-sme/develop/src/fst/stems/nouns.lexc
gearretnjárggahas+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Hum:gearret#njárggahass
JOHTOLAT ;
gearretriekti+N+Sem/Rule:gearret#riekºti ALBMI ;
gearretstohpu+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Build:gearret#stohºpu
ALBMI ;
gearru+N+Sem/Dummytag:gearru ALBMI ;
gearsi+N+Sem/Dummytag:gearºsi GOAHTI-I ;
geasanas+N+Sem/Tool-music:geasanass JOHTOLAT ;
geasehandoarjja+N+Sem/Money:geasehan#doarºjag8 SEAMU ;
geasehanfanas+N+Sem/Veh:geasehan#fatnas MALIS ;
geasehangaska+N+Sem/Dummytag:geasehan#gasºka GOAHTI-A ;
geasehangollu+N+Sem/Money:geasehan#gollu GOAHTI-U ;
https://github.com/giellalt/lang-sme/blob/develop/src/cg3/disambiguator.cg3
And here is an example from the CG:
REMOVE:KillLocPl (Pl Loc) IF (0 (Sg Com) LINK 0 TOOL OR Sem/Clth OR
Sem/Body OR Sem/Money OR NUMUNIT OR Sem/Sign OR Sem/Perc-emo OR Sem/Lang
OR Sem/Domain - ("ealáhus") OR VEHICLE OR PROSEANTA OR Sem/Feat-phys OR
Sem/State-sick OR ("lasáhus") OR ("tihttel") LINK NOT 0 ("čalbmi") OR
("suotna") OR Sem/Hum) (NEGATE -1 (Dem Pl Loc))(NEGATE 0 (Pl Loc) LINK
*-1 OKTA BARRIER NOT-NPMOD) ;
Having a separate file would be an option.
Another question is, should this be source language, target language, or
both?
In general, the idea in Apertium has been to not make distinctions
unless necessary
for translation... the concept of the 'free ride'. We generally don't
need to
distinguish the subsenses of estación when translating from Spanish to
Catalan as estació will generally suffice.
If this is then a translation issue, then the appropriate point of the
pipeline
is lexical selection and the right unit to disambiguate is
^mango/manec/mango$.
If they are considered separate lexemes, then the right place for
disambiguation
is in morphological disambiguation ^mango/mango¹/mango²$, and then a
module
might look like something that reweights analyses using a bag of words
technique.
So maybe a secondary file which stores structured information that can
be read
by any of the modules. The key could be a reading or a prefix of a
reading and
the file could contain e.g. embeddings (tuned for either POS tagging or
lexical
selection), subcategorisation information for transfer, e.g. triples of
verb + adposition + noun. It could also contain things like sets of
syntactic
labels with weights/frequencies.
Fran
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff