Peter S. May wrote: > After this, the first part of the > line is repeated, except it is as if it were filtered > through the command: > > tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z' > > That is, for every "REGNADKCIN" that appears > on the left side, there is > a "H46D03A28D" on the right side.
That's a clever way of dramatically increasing the "uniqueness" of each character to reduce the ambiguity of the OCR. It would be useful for both error detection and error correction. If it could be integrated into the OCR engine itself, it would be even more effective. Although Gallager or Turbo Codes would give much better error correction for a given storage space, your method would be way easier to implement. I'm leaning strongly against base64. There are just too many characters that can be easily confused. Base32 would be nearly as dense (5 bits instead of 6, per char) and would allow many tough characters to be left out. A simple conversion chart for base32 chars could take up just one line at the bottom of the page. The conversion to base32 and back would be very easy. Selecting the unambiguous 32 characters to use as the symbol set would require some care. Maybe some testing to find out which symbols the OCR programs get wrong most often. > ...this wouldn't be the first time this sort of thing were done. The only thing I've found similar is the Centinel Data Archiving Project. http://www.cedarcreek.umn.edu/tools/t1003.html The pdf file is a much clearer explanation than the other two. Centinel seems to be just an error detecting code at the beginning of each line. This might be good enough, but I'm starting to think that some error correction would be highly desirable. Even a little error correction could be a huge advantage over just error detection. > For some reason the first example that jumps to mind is 8-to-10 coding > as used in Serial ATA. I'm no electrical engineer, but by some > intuition the encoding of an 8-bit word into an exactly equivalent > 10-bit word with superior signal characteristics makes sense to me. I think most error correction codes mix the code bits with the data bits. I'd like to keep the data in separate blocks to make it easy for humans to separate and decode it. Unfortunately separating the error correction bits probably makes the code less robust. If we want to intermix the error correction code maybe we could include a note at the bottom that says "the third,sixth,ninth,etc columns and rows are error correction data". We also don't need the feature of hard drives and some signaling methods that make sure there are a good mixture of ones and zeros in order to keep the signal oscillating. We can have all zeros or all ones on paper if we want, with no signal detection problems. I was thinking about just using a normal typewriter size font. But then I realized that if we use a font half the size, it would not only improve data density, but we could include extra error correction. A small font with more error correction would probably be more reliable than a large font with less error correction. _______________________________________________ Gnupg-users mailing list Gnupg-users@gnupg.org http://lists.gnupg.org/mailman/listinfo/gnupg-users