Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that? # The information provided herein about Swahili may not be accurate # it is just intended to illustrate the problem # The first message had an error. Sorry for that
Hi Tutors, I would appreciate it if you gave me ideas about how to tackle this problem. Assigninig POS tags to words is a major step in many linguistic analyses. POS tags give the grammatical category of words, for example: The Determiner man Noun who RelativePronoun came Verb to Preposition us AccusativePluralPronoun is CopulaPresent an Determiner engineer Noun What we usually do is train a Part-of-Speech Tagger, and then test it on an already tagged (gold standard) test set. After running the tagger, we get something like this: The Determiner Determiner man Noun PresentVerb who RelativePronoun RelativePronoun came Verb Verb to Preposition Preposition us AccusativePluralPronoun AccusativePluralPronoun is CopulaPresent CopulaPresent an Determiner Determiner engineer Noun Noun As can be seen from above, the POS tagger assigned the wrong Part of Speech to the word "man", and this makes it easy to calculate the POS tagger accuracy, simply 8 out of 9 are correct (88.8%). Swahili is a morphologically complex language. The same sentence above is usaually written as: theman whocametous isanengineer This means that we should run a word segmenter before running the POS tagger. The word segmenter of course makes mistakes which will affect the accuracy of the POS tagger. We get an output like the following where the second word (sic) is ill-segmented: # Segmenter + POS Tagger output file the Determiner man Noun whocame Noun to Preposition us AccusativePluralPronoun is CopulaPresent an Determiner engineer Noun Now, how can I measure the accuracy of this output file against the gold standard file below given that the line alignment is lost every time the segmenter makes a mistake, which happens at the rate of 15 per 1000 words: # Gold Standard File The Determiner man Noun who RelativePronoun to Preposition us AccusativePluralPronoun is CopulaPresent an Determiner engineer Noun Please note that the output file is usually in the range of 100,000 words -- -- لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد الغزالي "No victim has ever been more repressed and alienated than the truth" Emad Soliman Nawfal Indiana University, Bloomington --------------------------------------------------------
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor