Not meaning to kick a dead thread, but this whole conversation has gotten me thinking about how to produce an effective variant of base64 for paper storage. Base64 is an interesting solution because it fully encodes raw data into what is effectively printable characters. It was yet obviously not designed for data on paper, at least initially, because of possible ambiguities in the glyphs it does use.
To correct this wouldn't be the first time this sort of thing were done. For some reason the first example that jumps to mind is 8-to-10 coding as used in Serial ATA. I'm no electrical engineer, but by some intuition the encoding of an 8-bit word into an exactly equivalent 10-bit word with superior signal characteristics makes sense to me. That said, the recipe for base64 is already well-known--each character represents its 6-bit index in the string "A-Za-z0-9+/". I really don't think anyone wants to do too much messing with this elegant formula. I've come up with something which I haven't yet tried to implement but which I think would be interesting to try. Let's call it "proofreadable base64". It's not terribly efficient, but we're going for recoverability more than efficiency. It goes something like this: We can assume that each line of our medium is capable of relaying 76 relatively legible characters. The first 32 are data in normal base64. Then, there is a space and a CRC-24 as specified in OpenPGP. Then, there are two spaces. After this, the first part of the line is repeated, except it is as if it were filtered through the command: tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z' That is, for every "REGNADKCIN" that appears on the left side, there is a "H46D03A28D" on the right side. The output should be printed using a legible, fixed-width font in order to preserve column alignment. For our 137.5% increase in size, we've gotten a great deal of correctability. Firstly, every base64 character has effectively become a less ambiguous digraph in this encoding. It's probably easy for OCR to confuse 0, O, o, and Q in base64, but the pairs 0/n, O/E, o/b, Q/G are far less ambiguous. Secondly, an equivalently disambiguated CRC-24 on each line can tell a program which lines need to be reexamined in the first place. Combined with the first property, this could go a long way in helping the computer correct its own errors. For example, if the CRC came up incorrect, and an o/n pair appeared in the input, it would definitely try converting the error to a 0/n pair. Finally, in the event that this relatively simple checking mechanism is forgotten, we can cover up the last three columns of the paper, scan it, and try to read it in as plain base64. (That said, we could really also prepend the source of a checking program to the printed output. :-) What does everyone think? Thanks PSM
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Gnupg-users mailing list Gnupg-users@gnupg.org http://lists.gnupg.org/mailman/listinfo/gnupg-users