A minor update on:
"Note that for level (1) one does not really need to store counts. One
can simply store the winning lexical form for each surface form."
I would never store winning surface forms for unambiguous forms.
I would never store winning surface forms where there is no clear win
(i.e., a certain number of times).
This would make the list shorter
Mikel
On Sun, 26 de Oct 2014 a les 7:00 PM, Mikel Forcada <m...@dlsi.ua.es>
wrote:
Fran, folks, here's the feedback I promised.
As I said, this is a great idea, particuarly to round off the work by
a constraint grammar, and I think the breakdown in GCI tasks could
work, perhaps excepting the integration in the current tagger.
In a trained corpus we could collect counts in various levels as a
fallback:
(1) Complete lexical forms: cantar.vblex.ifi.1.pl
(2) Lemma-less counts: *.vblex.ifi.1.pl
(3) Category only: *.vblex.*
The last two levels can be determined without any need for a
configuration file.
So that for an unknown word we can use some more general counts.
These general counts could be obtained from untagged corpora using
naïve fractional counting, as was done in SWPOST when no context was
taken into account.
Note that for level (1) one does not really need to store counts. One
can simply store the winning lexical form for each surface form.
Note also that for levels (2) and (3) one does not really need to
store counts. An ordered list by decreasing number of frequency could
be enough: the first form found would win.
I would keep this separate from FSTs.
Cheers
Mikel
On Sun, 26 de Oct 2014 a les 4:07 PM, Francis Tyers
<fty...@prompsit.com> wrote:
Hey all,
I had this idea and would be interested in getting feedback. I think
it
could be nicely split into a series of GCI tasks...
Idea: Make a mode for the apertium-tagger that performs lexicalised
unigram tagging. This would go in the pipeline after a constraint
grammar, and resolve remaining ambiguity by just selecting the most
frequent analysis for a given surface form. It could back off to
most
frequent tag string in the event that the training data does not
contain
all the surface forms.
The benefit of this over using the existing apertium-tagger would
be: it
would a _lot_ easier to train (no need for a .tsx file), no breakage
when you add new multiwords or contractions, no need to worry about
tokenisation.
I envisage at least five GCI tasks:
1) Write a prototype in a programming language of your choice (e.g.
python)
2) Come up with a data format for storing the model
3) Write a program to train a model from a tagged corpus (we have
this
now in some way for at least English, Spanish, Catalan, Russian and
Tatar)
4) Write a program to run the tagger on a text
5) Integrate the tagger into the apertium-tagger code (could be done
like the SWPOST one).
Another thing if it gets done would be to see if it could be trained
in
a similar way to the bigram tagger/lexical selection module using TL
information.
Downside: This would achieve something similar to adding weighted
FST
support to lttoolbox and getting lttoolbox to output analyses by
weight.
Something I think would be more desirable.
Any thoughts ?
Fran
------------------------------------------------------------------------------
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff
------------------------------------------------------------------------------
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff