On the other hand, just for the FUN of it, can you write some software to find and fix (or simply flag) the most common errors?

When I had to terminate my publisher, I was s'posed to receive a copy of their customer database. They deleted all delimiters (spaces, commas, periods, and other punctuation), and then printed it out on greenbar with a font that used the same character for zero and letter 'O'; and same character for one, lower case 'l', and upper cse 'I'. Surprisingly, they did NOT use a bad ribbon and printhead!

An acquaintance OCR'ed it. They were able to get what they thought was "80 to 90%" from the originals, but not from a xerox copy that visually seemed to be just as good.

I spent a little time writing some simple code to parse and fix most of it. Mostly simple context, such as a zero between two letters is likely an 'O', or an 'O' between two numerals is likely a zero. Similarly with one, lower case 'L' and upper case 'I'. Some OCR software now pays attention to context. Five consecutive numerals following two capital letters is likely to be a zip code, and end of the record. USUALLY. Comparison of those digits with the two letters in a zipcode database provided partial confirmation.
. . . and so forth . . .

Not a practical use of time, but a fun exercise in parsing.


Another time, the .SRT file that I found for "Company Man" used upper case 'I' instead of lower case 'L'! (AND had a three minute offset for the start time) Did not take very long to fix.

--
Grumpy Ol' Fred                 ci...@xenosoft.com


On Tue, 1 Jan 2019, dwight wrote:

Fred is right, OCR is only worth it if the document is in perfect condition. I 
just finish getting an old 4004 listing working. I made only two mistakes on 
the 4K of code that were not the fault of the poorness of the listing. Twice I 
put LDM instead of LD. LDM was the most commonly used.
There were still some 15 or so other errors do to the printing. It looked to be 
done on a ASR33 with poor registration of the print drum. Cs and 0s were often 
missing the right 1/3. Expecting an OCR to do much would have been a folly. 
Even though some 85% to 90% could be read properly. It took be about 3 weeks of 
evenings to make heads or tails of the code. I've finally got it running 
correctly.
If it had been done with an OCR, many cases it would have simply put a C 
instead of a 0. I'd have had to go through the listing, checking each C to make 
sure it was right. It is easier in many cases to have analysed what I could see 
and make a judgement, based on what I could see and the general context as I 
was typing it in.
Dwight

________________________________
From: cctalk <cctalk-boun...@classiccmp.org> on behalf of Fred Cisin via cctalk 
<cctalk@classiccmp.org>
Sent: Monday, December 31, 2018 9:46 AM
To: General Discussion: On-Topic and Off-Topic Posts
Subject: Re: OCR old software listing

On Mon, 31 Dec 2018, Larry Kraemer via cctalk wrote:
I used the libtiff-tools (Debian 8.x - 32 Bit) to extract all 61 .TIF's
from the Multipage .tif file.  While the .tif's look descent, and
RasterVect shows the .tif properties to be Group 4 Fax (1bpp) with 5100
x 6600 pixels - 300 DPI, I can't get tesseract 3.x, TextBridge Classic
2.0, or Irfanview with KADMOS Plugin to OCR any of the .tif files, with
descent results.  I'd expect an OCR of 85 to 90 % correct conversion to
ASCII text.

Software listings need more accuraacy than that.
How many wrong characters does it take for a program not to work?
"desCent" isn't good enough.

85 to 90 % correct is a character wrong in every 6 to 10 characters.
How many errors is that PER LINE?

"But, you can start with that, and just fix the errors, without retyping
the rest."  Doing it that way is a desCent into madness.
BTDT.  wore out the T-shirts.


A competent typist can retype the whole thing faster than fixing an error
in every six to ten characters.
Only if there is less than one error for every several hundred characters
does "patching it" save time for a competent typist.
In general, for a competent typist, the fastest way to reposition the
cursor to the next error in the line is to simply hit the keys of the
intervening letters.
It is NOT to move the cursor with the mouse, then put your hand back on
the keys to type a character.
Using cursor motion keys is no faster for a competent typist than hitting
the keys of the letters toskip over.


TIP: display the OCR'ed text that is to be corrected in a font that
exaggerates the difference between zero and the letter 'O', and between
one and lower case 'l'.  There are some programs that will attempt to
select those based on context.

--
Grumpy Ol' Fred                  ci...@xenosoft.com

Reply via email to