Hey all, I had this idea and would be interested in getting feedback. I think it could be nicely split into a series of GCI tasks...
Idea: Make a mode for the apertium-tagger that performs lexicalised unigram tagging. This would go in the pipeline after a constraint grammar, and resolve remaining ambiguity by just selecting the most frequent analysis for a given surface form. It could back off to most frequent tag string in the event that the training data does not contain all the surface forms. The benefit of this over using the existing apertium-tagger would be: it would a _lot_ easier to train (no need for a .tsx file), no breakage when you add new multiwords or contractions, no need to worry about tokenisation. I envisage at least five GCI tasks: 1) Write a prototype in a programming language of your choice (e.g. python) 2) Come up with a data format for storing the model 3) Write a program to train a model from a tagged corpus (we have this now in some way for at least English, Spanish, Catalan, Russian and Tatar) 4) Write a program to run the tagger on a text 5) Integrate the tagger into the apertium-tagger code (could be done like the SWPOST one). Another thing if it gets done would be to see if it could be trained in a similar way to the bigram tagger/lexical selection module using TL information. Downside: This would achieve something similar to adding weighted FST support to lttoolbox and getting lttoolbox to output analyses by weight. Something I think would be more desirable. Any thoughts ? Fran ------------------------------------------------------------------------------ _______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff