2009/3/24 Alan Gauld <alan.ga...@btinternet.com> > Hi, > That was an interesting post, but I'm not sure what you want help with. > Is it the word splitting? > Is it writing the POS tagger? > Is it comparing tthe POS tagger to the standard? > Or all of these? > > Alan G. > > "Emad Nawfal (عماد نوفل)" <emadnaw...@gmail.com> wrote in message > news:652641e90903240835o610d013dsd6a81f4675c47...@mail.gmail.com... > > Evaluating Swahili Part of Speech Tagging. How can I write a Python script > for that? > # The information provided herein about Swahili may not be accurate > # it is just intended to illustrate the problem > > Hi Tutors, > I would appreciate it if you gave me ideas about how to tackle this > problem. > > > Assigninig POS tags to words is a major step in many linguistic analyses. > POS tags give the grammatical category of words, for example: > > The Determiner > man Noun > who RelativePronoun > came Verb > to Preposition > us AccusativePluralPronoun > is CopulaPresent > an Determiner > engineer Noun > > What we usually do is train a Part-of-Speech Tagger, and then test it on an > already tagged (gold standard) test set. After running the tagger, we get > something like this: > > The Determiner Determiner > man Noun PresentVerb > who RelativePronoun RelativePronoun > came Verb Verb > to Preposition Preposition > us AccusativePluralPronoun AccusativePluralPronoun > is CopulaPresent CopulaPresent > an Determiner Determiner > engineer Noun Noun > > As can be seen from above, the POS tagger assigned the wrong Part of Speech > to the word "man", and this makes it easy to calculate the POS tagger > accuracy, simply 8 out of 9 are correct (88.8%). > > Swahili is a morphologically complex language. The same sentence above is > usaually written as: > > theman whocametous isanengineer > > This means that we should run a word segmenter before running the POS > tagger. The word segmenter of course makes mistakes which will affect the > accuracy of the POS tagger. > We get an output like the following where the second word (sic) is > ill-segmented: > > # Segmenter + POS Tagger output file > the Determiner > whocame Noun > to Preposition > us AccusativePluralPronoun > is CopulaPresent > an Determiner > engineer Noun > > Now, how can I measure the accuracy of this output file against the gold > standard file below given that the line alignment is lost every time the > segmenter makes a mistake, which happens at the rate of 15 per 1000 words: > > # Gold Standard File > The Determiner > man Noun > who RelativePronoun > to Preposition > us AccusativePluralPronoun > is CopulaPresent > an Determiner > engineer Noun > > Please note that the output file is usually in the range of 100,000 words > Hi Alan, Comparing the POS tagger output to the standard. is what I want. I can do it if I combine the segments into words and the segment tags into complex tags, which is possible. BUT I'm wondering whether this can be done just using the segments.
> > -- > لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد > الغزالي > "No victim has ever been more repressed and alienated than the truth" > > Emad Soliman Nawfal > Indiana University, Bloomington > -------------------------------------------------------- > > > > > -------------------------------------------------------------------------------- > > > _______________________________________________ >> Tutor maillist - Tutor@python.org >> http://mail.python.org/mailman/listinfo/tutor >> >> > > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > -- لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد الغزالي "No victim has ever been more repressed and alienated than the truth" Emad Soliman Nawfal Indiana University, Bloomington --------------------------------------------------------
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor