Hey all,

I had this idea and would be interested in getting feedback. I think it 
could be nicely split into a series of GCI tasks...

Idea: Make a mode for the apertium-tagger that performs lexicalised 
unigram tagging. This would go in the pipeline after a constraint 
grammar, and resolve remaining ambiguity by just selecting the most 
frequent analysis for a given surface form. It could back off to most 
frequent tag string in the event that the training data does not contain 
all the surface forms.

The benefit of this over using the existing apertium-tagger would be: it 
would a _lot_ easier to train (no need for a .tsx file), no breakage 
when you add new multiwords or contractions, no need to worry about 
tokenisation.

I envisage at least five GCI tasks:

1) Write a prototype in a programming language of your choice (e.g. 
python)
2) Come up with a data format for storing the model
3) Write a program to train a model from a tagged corpus (we have this 
now in some way for at least English, Spanish, Catalan, Russian and 
Tatar)
4) Write a program to run the tagger on a text
5) Integrate the tagger into the apertium-tagger code (could be done 
like the SWPOST one).

Another thing if it gets done would be to see if it could be trained in 
a similar way to the bigram tagger/lexical selection module using TL 
information.

Downside: This would achieve something similar to adding weighted FST 
support to lttoolbox and getting lttoolbox to output analyses by weight. 
Something I think would be more desirable.

Any thoughts ?

Fran

------------------------------------------------------------------------------
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to