Hello Professor Mattmann,
I have completed the basic requirements of TIKA assignment ( without OCR
quality check) and now I want to go for the extra edit part. I plan to use
Levenshtein distance implemented in apache's commons-lang3-3.1.jar file.
I tried the following :
---------------------------
After I extract all of the text from each PDF file, I need to find out
Levenshtein distance between each of the keywords in my set of '11 keywords'
and the extracted text.
Since the extracted text is a very long string, I thought to split this text
on new line character("\n"). For each line, I compute the edit distance
keeping the threshold very low.
However, this does not seem to be the correct approach since the extracted
text contains a good amount of junk characters due to OCR noise and error.
I need to do some pre-processing on the extracted text first.
Pointers along the right direction/approach will greatly help.
Thanks !
-Margi