[Tutor] Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that?
Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that? # The information provided herein about Swahili may not be accurate # it is just intended to illustrate the problem Hi Tutors, I would appreciate it if you gave me ideas about how to tackle this problem. Assigninig POS tags to words is a major step in many linguistic analyses. POS tags give the grammatical category of words, for example: The Determiner man Noun who RelativePronoun came Verb to Preposition us AccusativePluralPronoun is CopulaPresent an Determiner engineer Noun What we usually do is train a Part-of-Speech Tagger, and then test it on an already tagged (gold standard) test set. After running the tagger, we get something like this: The DeterminerDeterminer man NounPresentVerb who RelativePronounRelativePronoun came VerbVerb to PrepositionPreposition us AccusativePluralPronounAccusativePluralPronoun is CopulaPresentCopulaPresent an DeterminerDeterminer engineer NounNoun As can be seen from above, the POS tagger assigned the wrong Part of Speech to the word man, and this makes it easy to calculate the POS tagger accuracy, simply 8 out of 9 are correct (88.8%). Swahili is a morphologically complex language. The same sentence above is usaually written as: theman whocametous isanengineer This means that we should run a word segmenter before running the POS tagger. The word segmenter of course makes mistakes which will affect the accuracy of the POS tagger. We get an output like the following where the second word (sic) is ill-segmented: # Segmenter + POS Tagger output file the Determiner whocame Noun to Preposition us AccusativePluralPronoun is CopulaPresent an Determiner engineer Noun Now, how can I measure the accuracy of this output file against the gold standard file below given that the line alignment is lost every time the segmenter makes a mistake, which happens at the rate of 15 per 1000 words: # Gold Standard File The Determiner man Noun who RelativePronoun to Preposition us AccusativePluralPronoun is CopulaPresent an Determiner engineer Noun Please note that the output file is usually in the range of 100,000 words -- لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.محمد الغزالي No victim has ever been more repressed and alienated than the truth Emad Soliman Nawfal Indiana University, Bloomington ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that?
Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that? # The information provided herein about Swahili may not be accurate # it is just intended to illustrate the problem # The first message had an error. Sorry for that Hi Tutors, I would appreciate it if you gave me ideas about how to tackle this problem. Assigninig POS tags to words is a major step in many linguistic analyses. POS tags give the grammatical category of words, for example: The Determiner man Noun who RelativePronoun came Verb to Preposition us AccusativePluralPronoun is CopulaPresent an Determiner engineer Noun What we usually do is train a Part-of-Speech Tagger, and then test it on an already tagged (gold standard) test set. After running the tagger, we get something like this: The DeterminerDeterminer man NounPresentVerb who RelativePronounRelativePronoun came VerbVerb to PrepositionPreposition us AccusativePluralPronounAccusativePluralPronoun is CopulaPresentCopulaPresent an DeterminerDeterminer engineer NounNoun As can be seen from above, the POS tagger assigned the wrong Part of Speech to the word man, and this makes it easy to calculate the POS tagger accuracy, simply 8 out of 9 are correct (88.8%). Swahili is a morphologically complex language. The same sentence above is usaually written as: theman whocametous isanengineer This means that we should run a word segmenter before running the POS tagger. The word segmenter of course makes mistakes which will affect the accuracy of the POS tagger. We get an output like the following where the second word (sic) is ill-segmented: # Segmenter + POS Tagger output file the Determiner man Noun whocame Noun to Preposition us AccusativePluralPronoun is CopulaPresent an Determiner engineer Noun Now, how can I measure the accuracy of this output file against the gold standard file below given that the line alignment is lost every time the segmenter makes a mistake, which happens at the rate of 15 per 1000 words: # Gold Standard File The Determiner man Noun who RelativePronoun to Preposition us AccusativePluralPronoun is CopulaPresent an Determiner engineer Noun Please note that the output file is usually in the range of 100,000 words -- -- لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.محمد الغزالي No victim has ever been more repressed and alienated than the truth Emad Soliman Nawfal Indiana University, Bloomington ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that?
2009/3/24 Emad Nawfal (عماد نوفل) emadnaw...@gmail.com: Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that? # The information provided herein about Swahili may not be accurate # it is just intended to illustrate the problem Hello, Mr. Emad! Have you checked the NLTK (Natural Language Toolkit - http://www.nltk.org ) a Python package for Linguistics applications? Maybe they have something already implemented. I actually liked a lot their tutorials about python and using pythons for Linguistics. Very good explanations. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that?
2009/3/24 Emad Nawfal (عماد نوفل) emadnaw...@gmail.com 2009/3/24 Eduardo Vieira eduardo.su...@gmail.com 2009/3/24 Emad Nawfal (عماد نوفل) emadnaw...@gmail.com: Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that? # The information provided herein about Swahili may not be accurate # it is just intended to illustrate the problem Hello, Mr. Emad! Have you checked the NLTK (Natural Language Toolkit - http://www.nltk.org ) a Python package for Linguistics applications? Maybe they have something already implemented. I actually liked a lot their tutorials about python and using pythons for Linguistics. Very good explanations. I have checked the NLTK, and it does not seem to have something like this. Thanks for the suggestion though -- لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.محمد الغزالي No victim has ever been more repressed and alienated than the truth Emad Soliman Nawfal Indiana University, Bloomington Thanks James, I'm using the TnT POS Tagger, and I treat it as a black box, otherwise I have to write my own, which is a huge task. The Segmenter I use is home-grown, and it is supposedly the best available. I used to evaluate on whole words, and this was easy. After the segmentation and tagging, I combined the various segments of each word, and this elimintaed the discrepancy in alignment. For example, I would have an output like this: the+man Det+Noun the+man Det+Noun who+came+to+us tag whocame+to+us wrongTag It is easy to do it this way if you use a WORD_END_DELIMITER, but this is very tedious, and you have to recalculate the segment accuracy. I'm looking for something smarter than this. -- لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.محمد الغزالي No victim has ever been more repressed and alienated than the truth Emad Soliman Nawfal Indiana University, Bloomington ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor