Hello Professor Mattmann,

I have completed the basic requirements of TIKA assignment ( without OCR
quality check) and now I want to go for the extra edit part. I plan to use
Levenshtein distance implemented in apache's commons-lang3-3.1.jar file.

I tried the following :
---------------------------
After I extract all of the text from each PDF file, I need to find out
Levenshtein distance between each of the keywords in my set of '11 keywords'
and the extracted text.
Since the extracted text is a very long string, I thought to split this text
on new line character("\n"). For each line, I compute the edit distance
keeping the threshold very low.

However, this does not seem to be the correct approach since the extracted
text contains a good amount of junk  characters due to OCR noise and error.
I need to do some pre-processing on the extracted text first.

Pointers along the right direction/approach will greatly help. 

Thanks !
-Margi



Reply via email to