Use of Levenshtein distance to find similar words

Margi Patel Sun, 16 Mar 2014 11:41:09 -0700

Hello Professor Mattmann,

I have completed the basic requirements of TIKA assignment ( without OCR
quality check) and now I want to go for the extra edit part. I plan to use
Levenshtein distance implemented in apache's commons-lang3-3.1.jar file.


I tried the following :
---------------------------
After I extract all of the text from each PDF file, I need to find out
Levenshtein distance between each of the keywords in my set of '11 keywords'
and the extracted text.
Since the extracted text is a very long string, I thought to split this text
on new line character("\n"). For each line, I compute the edit distance
keeping the threshold very low.

However, this does not seem to be the correct approach since the extracted
text contains a good amount of junk  characters due to OCR noise and error.
I need to do some pre-processing on the extracted text first.

Pointers along the right direction/approach will greatly help. 

Thanks !
-Margi

Use of Levenshtein distance to find similar words

Reply via email to