[lingu-dev] Spell checking metrics; was:[native-lang] Status update season!

Kevin Scannell Tue, 19 Dec 2006 21:32:07 -0800

On 12/19/06, Lars Aronsson <[EMAIL PROTECTED]> wrote:

I'm currently trying to improve the Swedish dictionary, which is
maintained by a friend of mine, so I'm looking for ways to compare
the quality of different dictionaries, and various methods used
for maintaining them.  The naive approach would be to complain
"the dictionary doesn't contain words X, Y, and Z", to which the
reply would be "so, add them".  However, this is a never-ending
task. The more words I add, the more I discover to be missing.
Just adding words to a dictionary is not so important, as the
spell checker's ability to help its user to avoid mistakes.  But
how is this measured?  When I compare Swedish to some other
closely related languages, I find their dictionaries are much
larger than the Swedish one, and this is one useful measure for
me.  But that doesn't help me to compare the quality of the
Swedish dictionary to the one for Persian.  On the page
http://wiki.services.openoffice.org/wiki/Translation_Statistics
there is an indication that 57% of the GUI for OpenOffice.org is
translated to Arabic and 62% to Persian.  We could add a column in
that table to tell us the quality (from 0 to 100%) of the spelling
dictionary for each language, but how would we measure this?


Hi Lars,

I'm cross-posting to the lingucomponent list, where this topic is a
better fit, and I'd encourage any follow-up messages to move to
lingucomponent since that is where most of the dictionary
maintainers/devs hang out.

Anyway, this is a great question and something I've thought a lot
about as part of my work on developing language technology for
minority languages.

There are a few naive metrics one can use.  I usually set things up this way:

(1) Given any text, you can split the words into "valid" (V) and
"invalid" (I) words, independent of your spellchecker.  I define
"valid" to mean "a word that a human proofreader would correct *in the
given context*".  So, given sequence of characters like "sed", it
might be valid in once place ("Did you know sed is Turing complete?)
and not in another ("Did you know Turing sed that?").   There are no
hairs to split in most cases, of course -- something like
"misssspelling" is surely invalid (except, of course, in the message
I'm typing right now, where I definitely would not want it corrected!)

(2) Next, when you run a spellchecker on your text, it splits the same
words into "accepted" (A) words or "flagged" (F) words.   So the whole
text consists of words I label as
AV=accepted and valid
AI=accepted and invalid
FV=flagged and valid
FI=flagged and invalid

(3) With this notation, the standard metrics are "recall":  R=AV/V
(i.e. what fraction of the valid words do you recognize) and
"precision": P=AV/A (i.e. what fraction of the recognized words are
actually valid).   You see these a lot in evaluating search engines
(relevant vs. irrelevant documents) and spam filters (spam vs.
non-spam).   Since they work against each other, it is useful to
combine them into a single "F-score":   F=2PR/(P+R)

You can also write down recall/precision for the spellchecker's
performance at flagging invalid words: R=FI/I, P=FI/F, but I prefer
the approach above.


(4) To estimate these, run your spellchecker on, say, 100,000 words of
text and note which of the flagged words are valid or invalid.    You
then have FV and FI, and you know the total number of accepted words
(A = 100,000 - F).   The tricky part is to estimate AI.  These words
are the problem children in the world of spellcheckers.   One source
of AI words are misspellings in the word list - but hopefully with
care and proofreading these can be avoided or eliminated.  The hard
ones are things like "right", which is invalid (usually) in the
context  "right of passage" (should be "rite") but is so commonly
valid that it clearly cannot be removed from the word list.  Other
cases, like the obscure word "yor" in English, should clearly not be
included since they are most likely to be a misspelling of a common
word.   The precision/recall measures give you a disciplined,
mathematical way to decide between including/excluding a given word,
and I've found it very useful for Irish and some other languages.

In any case, you might make the optimistic assumption that AI is very
close to 0, so precision is 100%, and recall is just A/(A+FV) - this
is a simple quality measure, easy to compute.

There are other measures one can use, for instance evaluating the
quality of the suggestions made by the spellchecker, but I'll leave
that for another time, since it is more the responsibility of the
language-independent engine vs. the dictionary.

-Kevin

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[lingu-dev] Spell checking metrics; was:[native-lang] Status update season!

Reply via email to