Re: Proofreadable base64 (was Re: Printing Keys and using OCR.)
Casey Jones wrote: > That's a clever way of dramatically increasing the "uniqueness" of each > character to reduce the ambiguity of the OCR. It would be useful for > both error detection and error correction. If it could be integrated > into the OCR engine itself, it would be even more effective. Although > Gallager or Turbo Codes would give much better error correction for a > given storage space, your method would be way easier to implement. > > I'm leaning strongly against base64. There are just too many characters > that can be easily confused. Base32 would be nearly as dense (5 bits > instead of 6, per char) and would allow many tough characters to be left > out. A simple conversion chart for base32 chars could take up just one > line at the bottom of the page. The conversion to base32 and back would > be very easy. Selecting the unambiguous 32 characters to use as the > symbol set would require some care. Maybe some testing to find out which > symbols the OCR programs get wrong most often. Information density isn't the goal here. My general strategy, to lay out my context, is to encrypt my big .tar nightlies and offsite them--the survivability of the media the big stuff is on is effectively someone else's problem. (Not perfect, but good enough, and if you keep everything redundant, there's no real issue.) But you can't reasonably offsite the private key in the same way...otherwise, how do you open everything when the time comes? Via the system I've concocted, secring.gpg can be printed in under 300 lines. I peg that at around 4 one-sided pages of recoverable text--a small price to pay to maintain control of a key. Actually, the draw of this idea as far as I'm concerned is that it's highly translucent: I'm very interested in ideas like PDF417 and QR, but there's a lot of support software involved that might not be so readily available--or compilable--in a pinch. Base64, on the other hand, fits in my head with very little effort. This means that, even in the outright absence of software that will actually handle base64, I could MacGyver something up without too much trouble in nearly any programming language that makes sense (I'm generally YAPH, but I've been messing with awk a lot lately, considering that it's ubiquitous on any platform with an X in its name. But b64 is simple enough to do in C, or even VB if you must, or perhaps INTERCAL/brainf*ck/... if you enjoy an insane challenge). It must be noted that there's often a much easier way, though--base64 can be jimmied into a .eml-format file by using a mail client to create an e-mail with a dummy attachment, then changing the contents with a text editor and re-opening. (This trick has actually gotten me through some jams before!) In this way, it helps that base64 also happens to be extremely ubiquitous; there's almost doubtlessly an implementation already on your machine. Getting base64 data into a machine isn't trivial, but it can again be argued that you have most or all of what you need at any workstation (unless you're blind, but even then it's not out of the question). Barcodes and data matrix standards may wax and wane, but we can hopefully agree that OCR isn't leaving anytime soon. Besides, even if by some freak accident OCR were to drop off the face of the Earth, there are still human eyes, human minds, and possibly even administrative assistants willing to take dictation. ;-) The translated digraph base64 in the third column would probably be easy enough to figure out even without the translation key via some simple "cryptanalysis" (I'm not suggesting the tr step is a cipher, but it does act like one); if the message is clear enough to be human readable, it itself would provide a more or less complete ciphertext-to-plaintext mapping. I haven't done a great deal of research into how valuable the mapping I chose (tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z') actually is, but it's not an entirely random choice. In particular, it makes sure that A-Z and a-z aren't adjacent, so that, for example, S and s don't map to an equivalently similar upper/lower-case pair. It probably merits more investigation, but I want to implement the original thing first and do some live testing to verify that there's even any problem to correct. Probably the only complicated part is the CRC-24; you might have to be just slightly hardcore just to memorize the XORing polynomial involved (though the rest isn't that hard; I'm just not the "digits of pi" type). But that's mostly a tool for auto-correction anyway; you could get a long way with just the first and third columns. By the way, last night I decided to try to implement CRC-24 in awk. It seems to have worked. It's not terribly efficient; I tried to stick to POSIX rules for portability, and POSIX awk has no XOR operator. Implementing XOR using substr is a rather humorous farce, I must say... So, long and short, stay tuned. I'm close to a first implementation and test messages will be pa
Re: Proofreadable base64 (was Re: Printing Keys and using OCR.)
Peter S. May wrote: > After this, the first part of the > line is repeated, except it is as if it were filtered > through the command: > > tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z' > > That is, for every "REGNADKCIN" that appears > on the left side, there is > a "H46D03A28D" on the right side. That's a clever way of dramatically increasing the "uniqueness" of each character to reduce the ambiguity of the OCR. It would be useful for both error detection and error correction. If it could be integrated into the OCR engine itself, it would be even more effective. Although Gallager or Turbo Codes would give much better error correction for a given storage space, your method would be way easier to implement. I'm leaning strongly against base64. There are just too many characters that can be easily confused. Base32 would be nearly as dense (5 bits instead of 6, per char) and would allow many tough characters to be left out. A simple conversion chart for base32 chars could take up just one line at the bottom of the page. The conversion to base32 and back would be very easy. Selecting the unambiguous 32 characters to use as the symbol set would require some care. Maybe some testing to find out which symbols the OCR programs get wrong most often. > ...this wouldn't be the first time this sort of thing were done. The only thing I've found similar is the Centinel Data Archiving Project. http://www.cedarcreek.umn.edu/tools/t1003.html The pdf file is a much clearer explanation than the other two. Centinel seems to be just an error detecting code at the beginning of each line. This might be good enough, but I'm starting to think that some error correction would be highly desirable. Even a little error correction could be a huge advantage over just error detection. > For some reason the first example that jumps to mind is 8-to-10 coding > as used in Serial ATA. I'm no electrical engineer, but by some > intuition the encoding of an 8-bit word into an exactly equivalent > 10-bit word with superior signal characteristics makes sense to me. I think most error correction codes mix the code bits with the data bits. I'd like to keep the data in separate blocks to make it easy for humans to separate and decode it. Unfortunately separating the error correction bits probably makes the code less robust. If we want to intermix the error correction code maybe we could include a note at the bottom that says "the third,sixth,ninth,etc columns and rows are error correction data". We also don't need the feature of hard drives and some signaling methods that make sure there are a good mixture of ones and zeros in order to keep the signal oscillating. We can have all zeros or all ones on paper if we want, with no signal detection problems. I was thinking about just using a normal typewriter size font. But then I realized that if we use a font half the size, it would not only improve data density, but we could include extra error correction. A small font with more error correction would probably be more reliable than a large font with less error correction. ___ Gnupg-users mailing list Gnupg-users@gnupg.org http://lists.gnupg.org/mailman/listinfo/gnupg-users
Proofreadable base64 (was Re: Printing Keys and using OCR.)
Not meaning to kick a dead thread, but this whole conversation has gotten me thinking about how to produce an effective variant of base64 for paper storage. Base64 is an interesting solution because it fully encodes raw data into what is effectively printable characters. It was yet obviously not designed for data on paper, at least initially, because of possible ambiguities in the glyphs it does use. To correct this wouldn't be the first time this sort of thing were done. For some reason the first example that jumps to mind is 8-to-10 coding as used in Serial ATA. I'm no electrical engineer, but by some intuition the encoding of an 8-bit word into an exactly equivalent 10-bit word with superior signal characteristics makes sense to me. That said, the recipe for base64 is already well-known--each character represents its 6-bit index in the string "A-Za-z0-9+/". I really don't think anyone wants to do too much messing with this elegant formula. I've come up with something which I haven't yet tried to implement but which I think would be interesting to try. Let's call it "proofreadable base64". It's not terribly efficient, but we're going for recoverability more than efficiency. It goes something like this: We can assume that each line of our medium is capable of relaying 76 relatively legible characters. The first 32 are data in normal base64. Then, there is a space and a CRC-24 as specified in OpenPGP. Then, there are two spaces. After this, the first part of the line is repeated, except it is as if it were filtered through the command: tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z' That is, for every "REGNADKCIN" that appears on the left side, there is a "H46D03A28D" on the right side. The output should be printed using a legible, fixed-width font in order to preserve column alignment. For our 137.5% increase in size, we've gotten a great deal of correctability. Firstly, every base64 character has effectively become a less ambiguous digraph in this encoding. It's probably easy for OCR to confuse 0, O, o, and Q in base64, but the pairs 0/n, O/E, o/b, Q/G are far less ambiguous. Secondly, an equivalently disambiguated CRC-24 on each line can tell a program which lines need to be reexamined in the first place. Combined with the first property, this could go a long way in helping the computer correct its own errors. For example, if the CRC came up incorrect, and an o/n pair appeared in the input, it would definitely try converting the error to a 0/n pair. Finally, in the event that this relatively simple checking mechanism is forgotten, we can cover up the last three columns of the paper, scan it, and try to read it in as plain base64. (That said, we could really also prepend the source of a checking program to the printed output. :-) What does everyone think? Thanks PSM signature.asc Description: OpenPGP digital signature ___ Gnupg-users mailing list Gnupg-users@gnupg.org http://lists.gnupg.org/mailman/listinfo/gnupg-users