Re: [MarkLogic Dev General] Spell checking an entire document

Walter Underwood Tue, 22 Mar 2011 12:03:55 -0700

I was thinking of a lexicon, too. Make a lexicon on known-good text and use 
that as a check.


A different approach is to use corpus statistics. Text with errors will have 
more singleton words than will correct text. Compile the word frequencies for 
known good text, organize by rank, and find the slope of the log-log line. Text 
with errors should have a longer tail and thus a flatter slope. Or you could 
just count the number of distinct words that only occur once or twice in the 
corpus. You could probably use the frequency information from a lexicon to do 
that.

http://en.wikipedia.org/wiki/Zipf's_law

wunder
==
Walter Underwood
[email protected]<mailto:[email protected]>

On Mar 22, 2011, at 11:41 AM, Danny Sokolsky wrote:

Hi Greg,

Checking each word in an OCR'ed element seems reasonable to me.  I guess it 
depends how many words you have, but I would guess this would perform pretty 
decently up to a point (given that you do need to check every word).  Did you 
try this and it did not work well?

Another idea is you can create a word lexicon, then check each word in the word 
lexicon against a dictionary.  This will make it so you don't have to check 
duplicate words more than once.  Then, you can take all of the words that are 
not in the dictionary and that can give you an upper bound of possibly 
misspelled words.  I have not tried this, but it seems like it should work.

-Danny

-----Original Message-----
From: 
[email protected]<mailto:[email protected]>
 [mailto:[email protected]] On Behalf Of Murray, Gregory
Sent: Tuesday, March 22, 2011 7:50 AM
To: General MarkLogic Developer Discussion
Subject: [MarkLogic Dev General] Spell checking an entire document

The spell:is-correct() function checks a single word against a given 
dictionary. It's clearly intended for "Did you mean?" kinds of features. My use 
case is different: I want to run a spell check across an entire element, namely 
an element containing the uncorrected OCR text from an entire digitized book, 
so I can get a rough idea of the error rate.

Can anyone suggest a good approach? Do I need to tokenize all the words in the 
element and then loop over them, checking each word one by one with 
spell:is-correct? Or is there a better way?

Thanks,
Greg


Gregory Murray
Digital Library Application Developer
Princeton Theological Seminary Library
[email protected]<mailto:[email protected]>

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Spell checking an entire document

Reply via email to