I was thinking of a lexicon, too. Make a lexicon on known-good text and use that as a check.
A different approach is to use corpus statistics. Text with errors will have more singleton words than will correct text. Compile the word frequencies for known good text, organize by rank, and find the slope of the log-log line. Text with errors should have a longer tail and thus a flatter slope. Or you could just count the number of distinct words that only occur once or twice in the corpus. You could probably use the frequency information from a lexicon to do that. http://en.wikipedia.org/wiki/Zipf's_law wunder == Walter Underwood [email protected]<mailto:[email protected]> On Mar 22, 2011, at 11:41 AM, Danny Sokolsky wrote: Hi Greg, Checking each word in an OCR'ed element seems reasonable to me. I guess it depends how many words you have, but I would guess this would perform pretty decently up to a point (given that you do need to check every word). Did you try this and it did not work well? Another idea is you can create a word lexicon, then check each word in the word lexicon against a dictionary. This will make it so you don't have to check duplicate words more than once. Then, you can take all of the words that are not in the dictionary and that can give you an upper bound of possibly misspelled words. I have not tried this, but it seems like it should work. -Danny -----Original Message----- From: [email protected]<mailto:[email protected]> [mailto:[email protected]] On Behalf Of Murray, Gregory Sent: Tuesday, March 22, 2011 7:50 AM To: General MarkLogic Developer Discussion Subject: [MarkLogic Dev General] Spell checking an entire document The spell:is-correct() function checks a single word against a given dictionary. It's clearly intended for "Did you mean?" kinds of features. My use case is different: I want to run a spell check across an entire element, namely an element containing the uncorrected OCR text from an entire digitized book, so I can get a rough idea of the error rate. Can anyone suggest a good approach? Do I need to tokenize all the words in the element and then loop over them, checking each word one by one with spell:is-correct? Or is there a better way? Thanks, Greg Gregory Murray Digital Library Application Developer Princeton Theological Seminary Library [email protected]<mailto:[email protected]> _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
