[Tutor] Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that?

2009-03-24 Thread عماد نوفل
Evaluating Swahili Part of Speech Tagging. How can I write a Python script
for that?
# The information provided herein about Swahili may not be accurate
# it is just intended to illustrate the problem

Hi Tutors,
I would appreciate it if you gave me ideas about how to tackle this problem.


Assigninig POS tags to words is a major step in many linguistic analyses.
POS tags give the grammatical category of words, for example:

The Determiner
man Noun
who RelativePronoun
came Verb
to Preposition
us AccusativePluralPronoun
is CopulaPresent
an Determiner
engineer Noun

What we usually do is train a Part-of-Speech Tagger, and then test it on an
already tagged (gold standard) test set. After running the tagger, we get
something like this:

The DeterminerDeterminer
man NounPresentVerb
who RelativePronounRelativePronoun
came VerbVerb
to PrepositionPreposition
us AccusativePluralPronounAccusativePluralPronoun
is CopulaPresentCopulaPresent
an DeterminerDeterminer
engineer NounNoun

As can be seen from above, the POS tagger assigned the wrong Part of Speech
to the word man, and this makes it easy to calculate the POS tagger
accuracy, simply 8 out of 9 are correct (88.8%).

Swahili is a morphologically complex language. The same sentence above is
usaually written as:

theman whocametous isanengineer

This means that we should run a word segmenter before running the POS
tagger. The word segmenter of course makes mistakes which will affect the
accuracy of the POS tagger.
We get an output like the following where the second word (sic) is
ill-segmented:

# Segmenter + POS Tagger output file
the Determiner
whocame Noun
to Preposition
us AccusativePluralPronoun
is CopulaPresent
an Determiner
engineer Noun

Now, how can I measure the accuracy of this output file against the gold
standard file below given that the line alignment is lost every time the
segmenter makes a mistake, which happens at the rate of 15 per 1000 words:

# Gold Standard File
The Determiner
man Noun
who RelativePronoun
to Preposition
us AccusativePluralPronoun
is CopulaPresent
an Determiner
engineer Noun

Please note that the output file is usually in the range of 100,000 words

-- 
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.محمد
الغزالي
No victim has ever been more repressed and alienated than the truth

Emad Soliman Nawfal
Indiana University, Bloomington

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that?

2009-03-24 Thread عماد نوفل
Evaluating Swahili Part of Speech Tagging. How can I write a Python script
for that?
# The information provided herein about Swahili may not be accurate
# it is just intended to illustrate the problem
# The first message had an error. Sorry for that

Hi Tutors,
I would appreciate it if you gave me ideas about how to tackle this problem.

Assigninig POS tags to words is a major step in many linguistic analyses.
POS tags give the grammatical category of words, for example:

The Determiner
man Noun
who RelativePronoun
came Verb
to Preposition
us AccusativePluralPronoun
is CopulaPresent
an Determiner
engineer Noun

What we usually do is train a Part-of-Speech Tagger, and then test it on an
already tagged (gold standard) test set. After running the tagger, we get
something like this:

The DeterminerDeterminer
man NounPresentVerb
who RelativePronounRelativePronoun
came VerbVerb
to PrepositionPreposition
us AccusativePluralPronounAccusativePluralPronoun
is CopulaPresentCopulaPresent
an DeterminerDeterminer
engineer NounNoun

As can be seen from above, the POS tagger assigned the wrong Part of Speech
to the word man, and this makes it easy to calculate the POS tagger
accuracy, simply 8 out of 9 are correct (88.8%).

Swahili is a morphologically complex language. The same sentence above is
usaually written as:

theman whocametous isanengineer

This means that we should run a word segmenter before running the POS
tagger. The word segmenter of course makes mistakes which will affect the
accuracy of the POS tagger.
We get an output like the following where the second word (sic) is
ill-segmented:

# Segmenter + POS Tagger output file
the Determiner
man Noun
whocame Noun
to Preposition
us AccusativePluralPronoun
is CopulaPresent
an Determiner
engineer Noun

Now, how can I measure the accuracy of this output file against the gold
standard file below given that the line alignment is lost every time the
segmenter makes a mistake, which happens at the rate of 15 per 1000 words:

# Gold Standard File
The Determiner
man Noun
who RelativePronoun
to Preposition
us AccusativePluralPronoun
is CopulaPresent
an Determiner
engineer Noun

Please note that the output file is usually in the range of 100,000 words

-- 

-- 
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.محمد
الغزالي
No victim has ever been more repressed and alienated than the truth

Emad Soliman Nawfal
Indiana University, Bloomington

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that?

2009-03-24 Thread Eduardo Vieira
2009/3/24 Emad Nawfal (عماد نوفل) emadnaw...@gmail.com:
 Evaluating Swahili Part of Speech Tagging. How can I write a Python script
 for that?
 # The information provided herein about Swahili may not be accurate
 # it is just intended to illustrate the problem

Hello, Mr. Emad! Have you checked the NLTK (Natural Language Toolkit -
http://www.nltk.org ) a Python package for Linguistics applications?
Maybe they have something already implemented. I actually liked a lot
their tutorials about python and using pythons for Linguistics. Very
good explanations.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Evaluating Swahili Part of Speech Tagging. How can I write a Python script for that?

2009-03-24 Thread عماد نوفل
2009/3/24 Emad Nawfal (عماد نوفل) emadnaw...@gmail.com



 2009/3/24 Eduardo Vieira eduardo.su...@gmail.com

 2009/3/24 Emad Nawfal (عماد نوفل) emadnaw...@gmail.com:
  Evaluating Swahili Part of Speech Tagging. How can I write a Python
 script
  for that?
  # The information provided herein about Swahili may not be accurate
  # it is just intended to illustrate the problem
 
 Hello, Mr. Emad! Have you checked the NLTK (Natural Language Toolkit -
 http://www.nltk.org ) a Python package for Linguistics applications?
 Maybe they have something already implemented. I actually liked a lot
 their tutorials about python and using pythons for Linguistics. Very
 good explanations.



 I have checked the NLTK, and it does not seem to have something like this.
 Thanks for the suggestion though

 --
 لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.محمد
 الغزالي
 No victim has ever been more repressed and alienated than the truth

 Emad Soliman Nawfal
 Indiana University, Bloomington
 



Thanks James,
I'm using the TnT POS Tagger, and I treat it as a black box, otherwise I
have to write my own, which is a huge task.
The Segmenter I use is home-grown, and it is supposedly the best available.
I used to evaluate on whole words, and this was easy. After the segmentation
and tagging, I combined the various segments of each word, and this
elimintaed the discrepancy in alignment. For example, I would have an output
like this:

the+man Det+Noun the+man Det+Noun
who+came+to+us tag whocame+to+us wrongTag
It is easy to do it this way if you use a WORD_END_DELIMITER, but this is
very tedious, and you have to recalculate the segment accuracy.
I'm looking for something smarter than this.

-- 
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.محمد
الغزالي
No victim has ever been more repressed and alienated than the truth

Emad Soliman Nawfal
Indiana University, Bloomington

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor