Re: [Apertium-stuff] Apertium's Wider Use & Secondary Tags

Francis Tyers Sun, 14 Jun 2020 12:12:02 -0700


El 2020-06-14 19:40, Jonathan Washington escribió:

I could see another way to treat cases like mango¹/mango² or ат¹/ат².


If we were to eventually have a module that holds other arbitrary
information through the pipeline, you could have tags added in the
transducer that are immediately offloaded to the arbitrary information
storage, and are then accessible to disambiguation, lexical selection,
and bidix.

For example, you could have mango<n>[sem:fruit] and
mango<n>[sem:handle] (or whatever) returned by the transducer, with
the second part picked off by another module and sent through the
pipeline in some other format.

This is just me thinking out loud.


These are all interesting ideas and questions.

I do not believe that the .lexc or .dix files are the right place for
this kind of information. They were designed for morphology, not
for lexicography.

Here is an example of doing semantic tagging in .lexc:

https://raw.githubusercontent.com/giellalt/lang-sme/develop/src/fst/stems/nouns.lexc

gearretnjárggahas+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Hum:gearret#njárggahassJOHTOLAT ;

gearretriekti+N+Sem/Rule:gearret#riekºti ALBMI ;

gearretstohpu+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Build:gearret#stohºpuALBMI ;

gearru+N+Sem/Dummytag:gearru ALBMI ;
gearsi+N+Sem/Dummytag:gearºsi GOAHTI-I ;
geasanas+N+Sem/Tool-music:geasanass JOHTOLAT ;
geasehandoarjja+N+Sem/Money:geasehan#doarºjag8 SEAMU ;
geasehanfanas+N+Sem/Veh:geasehan#fatnas MALIS ;
geasehangaska+N+Sem/Dummytag:geasehan#gasºka GOAHTI-A ;
geasehangollu+N+Sem/Money:geasehan#gollu GOAHTI-U ;

https://github.com/giellalt/lang-sme/blob/develop/src/cg3/disambiguator.cg3

And here is an example from the CG:

REMOVE:KillLocPl (Pl Loc) IF (0 (Sg Com) LINK 0 TOOL OR Sem/Clth ORSem/Body OR Sem/Money OR NUMUNIT OR Sem/Sign OR Sem/Perc-emo OR Sem/LangOR Sem/Domain - ("ealáhus") OR VEHICLE OR PROSEANTA OR Sem/Feat-phys ORSem/State-sick OR ("lasáhus") OR ("tihttel") LINK NOT 0 ("čalbmi") OR("suotna") OR Sem/Hum) (NEGATE -1 (Dem Pl Loc))(NEGATE 0 (Pl Loc) LINK*-1 OKTA BARRIER NOT-NPMOD) ;


Having a separate file would be an option.

Another question is, should this be source language, target language, orboth?

In general, the idea in Apertium has been to not make distinctionsunless necessaryfor translation... the concept of the 'free ride'. We generally don'tneed to

distinguish the subsenses of estación when translating from Spanish to
Catalan as estació will generally suffice.

If this is then a translation issue, then the appropriate point of thepipelineis lexical selection and the right unit to disambiguate is^mango/manec/mango$.

If they are considered separate lexemes, then the right place fordisambiguationis in morphological disambiguation ^mango/mango¹/mango²$, and then amodulemight look like something that reweights analyses using a bag of wordstechnique.

So maybe a secondary file which stores structured information that canbe readby any of the modules. The key could be a reading or a prefix of areading andthe file could contain e.g. embeddings (tuned for either POS tagging orlexical

selection), subcategorisation information for transfer, e.g. triples of

verb + adposition + noun. It could also contain things like sets ofsyntactic

labels with weights/frequencies.

Fran


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Apertium's Wider Use & Secondary Tags

Reply via email to