Re: [MarkLogic Dev General] Spell checking an entire document

Danny Sokolsky Tue, 22 Mar 2011 11:41:24 -0700

Hi Greg,

Checking each word in an OCR'ed element seems reasonable to me.  I guess it 
depends how many words you have, but I would guess this would perform pretty 
decently up to a point (given that you do need to check every word).  Did you 
try this and it did not work well?


Another idea is you can create a word lexicon, then check each word in the word 
lexicon against a dictionary.  This will make it so you don't have to check 
duplicate words more than once.  Then, you can take all of the words that are 
not in the dictionary and that can give you an upper bound of possibly 
misspelled words.  I have not tried this, but it seems like it should work.

-Danny

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Murray, Gregory
Sent: Tuesday, March 22, 2011 7:50 AM
To: General MarkLogic Developer Discussion
Subject: [MarkLogic Dev General] Spell checking an entire document

The spell:is-correct() function checks a single word against a given 
dictionary. It's clearly intended for "Did you mean?" kinds of features. My use 
case is different: I want to run a spell check across an entire element, namely 
an element containing the uncorrected OCR text from an entire digitized book, 
so I can get a rough idea of the error rate.

Can anyone suggest a good approach? Do I need to tokenize all the words in the 
element and then loop over them, checking each word one by one with 
spell:is-correct? Or is there a better way?

Thanks,
Greg


Gregory Murray
Digital Library Application Developer
Princeton Theological Seminary Library
[email protected]

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Spell checking an entire document

Reply via email to