Re: [Apertium-stuff] GCI task ideas: Lexicalised unigram tagger

Mikel Forcada Sun, 26 Oct 2014 11:23:00 -0700

A minor update on:"Note that for level (1) one does not really need to store counts. Onecan simply store the winning lexical form for each surface form."


I would never store winning surface forms for unambiguous forms.

I would never store winning surface forms where there is no clear win(i.e., a certain number of times).


This would make the list shorter

Mikel

On Sun, 26 de Oct 2014 a les 7:00 PM, Mikel Forcada <m...@dlsi.ua.es>wrote:

Fran, folks, here's the feedback I promised.
As I said, this is a great idea, particuarly to round off the work bya constraint grammar, and I think the breakdown in GCI tasks couldwork, perhaps excepting the integration in the current tagger.In a trained corpus we could collect counts in various levels as afallback:
(1) Complete lexical forms: cantar.vblex.ifi.1.pl(2) Lemma-less counts: *.vblex.ifi.1.pl
(3) Category only:                    *.vblex.*
The last two levels can be determined without any need for aconfiguration file.
So that for an unknown word we can use some more general counts.These general counts could be obtained from untagged corpora usingnaïve fractional counting, as was done in SWPOST when no context wastaken into account.
Note that for level (1) one does not really need to store counts. Onecan simply store the winning lexical form for each surface form.Note also that for levels (2) and (3) one does not really need tostore counts. An ordered list by decreasing number of frequency couldbe enough: the first form found would win.
I would keep this separate from FSTs.

Cheers

Mikel
On Sun, 26 de Oct 2014 a les 4:07 PM, Francis Tyers<fty...@prompsit.com> wrote:
Hey all,
I had this idea and would be interested in getting feedback. I thinkitcould be nicely split into a series of GCI tasks...
Idea: Make a mode for the apertium-tagger that performs lexicalisedunigram tagging. This would go in the pipeline after a constraintgrammar, and resolve remaining ambiguity by just selecting the mostfrequent analysis for a given surface form. It could back off tomostfrequent tag string in the event that the training data does notcontainall the surface forms.
The benefit of this over using the existing apertium-tagger wouldbe: itwould a _lot_ easier to train (no need for a .tsx file), no breakagewhen you add new multiwords or contractions, no need to worry abouttokenisation.
I envisage at least five GCI tasks:
1) Write a prototype in a programming language of your choice (e.g.python)
2) Come up with a data format for storing the model
3) Write a program to train a model from a tagged corpus (we havethisnow in some way for at least English, Spanish, Catalan, Russian andTatar)
4) Write a program to run the tagger on a text
5) Integrate the tagger into the apertium-tagger code (could be donelike the SWPOST one).
Another thing if it gets done would be to see if it could be trainedina similar way to the bigram tagger/lexical selection module using TLinformation.
Downside: This would achieve something similar to adding weightedFSTsupport to lttoolbox and getting lttoolbox to output analyses byweight.Something I think would be more desirable.
Any thoughts ?

Fran

------------------------------------------------------------------------------
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

------------------------------------------------------------------------------

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GCI task ideas: Lexicalised unigram tagger

Reply via email to