Karen Coyle wrote:
I know that 98% is impressive, but I always like to remember that with an average of 2000 characters per page that means 40 potential errors per book page. Just to give us some perspective on the level of cleanup that will be needed for books being digitized today.
The "good" news from the perspective of searching is that a reasonable percentage of those errors will affect terms that are either rarely used in searching or are repeated correctly in the vicinity. The bad news: phrase search is compromised. Screen readers for the visually impaired are compromised. Relevance that depends on term clustered is compromised.

If we had to correct it all: a) it would never get done and b) it would be better than some of the originals which are rife with typographic errors.

Walter
 so still regrets the Swedish Chef OCR of most microfilm newspaper projects

Reply via email to