A minor update on: "Note that for level (1) one does not really need to store counts. One can simply store the winning lexical form for each surface form."

I would never store winning surface forms for unambiguous forms.
I would never store winning surface forms where there is no clear win (i.e., a certain number of times).

This would make the list shorter

Mikel

On Sun, 26 de Oct 2014 a les 7:00 PM, Mikel Forcada <m...@dlsi.ua.es> wrote:
Fran, folks, here's the feedback I promised.

As I said, this is a great idea, particuarly to round off the work by a constraint grammar, and I think the breakdown in GCI tasks could work, perhaps excepting the integration in the current tagger. In a trained corpus we could collect counts in various levels as a fallback:

(1) Complete lexical forms: cantar.vblex.ifi.1.pl (2) Lemma-less counts: *.vblex.ifi.1.pl
(3) Category only:                    *.vblex.*

The last two levels can be determined without any need for a configuration file.

So that for an unknown word we can use some more general counts. These general counts could be obtained from untagged corpora using naïve fractional counting, as was done in SWPOST when no context was taken into account.

Note that for level (1) one does not really need to store counts. One can simply store the winning lexical form for each surface form. Note also that for levels (2) and (3) one does not really need to store counts. An ordered list by decreasing number of frequency could be enough: the first form found would win.

I would keep this separate from FSTs.

Cheers

Mikel

On Sun, 26 de Oct 2014 a les 4:07 PM, Francis Tyers <fty...@prompsit.com> wrote:
Hey all,

I had this idea and would be interested in getting feedback. I think it could be nicely split into a series of GCI tasks...

Idea: Make a mode for the apertium-tagger that performs lexicalised unigram tagging. This would go in the pipeline after a constraint grammar, and resolve remaining ambiguity by just selecting the most frequent analysis for a given surface form. It could back off to most frequent tag string in the event that the training data does not contain all the surface forms.

The benefit of this over using the existing apertium-tagger would be: it would a _lot_ easier to train (no need for a .tsx file), no breakage when you add new multiwords or contractions, no need to worry about tokenisation.

I envisage at least five GCI tasks:

1) Write a prototype in a programming language of your choice (e.g. python)
2) Come up with a data format for storing the model
3) Write a program to train a model from a tagged corpus (we have this now in some way for at least English, Spanish, Catalan, Russian and Tatar)
4) Write a program to run the tagger on a text
5) Integrate the tagger into the apertium-tagger code (could be done like the SWPOST one).

Another thing if it gets done would be to see if it could be trained in a similar way to the bigram tagger/lexical selection module using TL information.

Downside: This would achieve something similar to adding weighted FST support to lttoolbox and getting lttoolbox to output analyses by weight. Something I think would be more desirable.

Any thoughts ?

Fran

------------------------------------------------------------------------------
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff
------------------------------------------------------------------------------
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to