Re: Proofreadable base64 (was Re: Printing Keys and using OCR.)

2007-05-29 Thread Peter S. May
Casey Jones wrote:
> That's a clever way of dramatically increasing the "uniqueness" of each 
> character to reduce the ambiguity of the OCR. It would be useful for 
> both error detection and error correction. If it could be integrated 
> into the OCR engine itself, it would be even more effective. Although 
> Gallager or Turbo Codes would give much better error correction for a 
> given storage space, your method would be way easier to implement.
> 
> I'm leaning strongly against base64. There are just too many characters 
> that can be easily confused. Base32 would be nearly as dense (5 bits 
> instead of 6, per char) and would allow many tough characters to be left 
> out. A simple conversion chart for base32 chars could take up just one 
> line at the bottom of the page. The conversion to base32 and back would 
> be very easy. Selecting the unambiguous 32 characters to use as the 
> symbol set would require some care. Maybe some testing to find out which 
> symbols the OCR programs get wrong most often.

Information density isn't the goal here.  My general strategy, to lay
out my context, is to encrypt my big .tar nightlies and offsite
them--the survivability of the media the big stuff is on is effectively
someone else's problem.  (Not perfect, but good enough, and if you keep
everything redundant, there's no real issue.)  But you can't reasonably
offsite the private key in the same way...otherwise, how do you open
everything when the time comes?  Via the system I've concocted,
secring.gpg can be printed in under 300 lines.  I peg that at around 4
one-sided pages of recoverable text--a small price to pay to maintain
control of a key.

Actually, the draw of this idea as far as I'm concerned is that it's
highly translucent:  I'm very interested in ideas like PDF417 and QR,
but there's a lot of support software involved that might not be so
readily available--or compilable--in a pinch.  Base64, on the other
hand, fits in my head with very little effort.  This means that, even in
the outright absence of software that will actually handle base64, I
could MacGyver something up without too much trouble in nearly any
programming language that makes sense (I'm generally YAPH, but I've been
messing with awk a lot lately, considering that it's ubiquitous on any
platform with an X in its name.  But b64 is simple enough to do in C, or
even VB if you must, or perhaps INTERCAL/brainf*ck/... if you enjoy an
insane challenge).

It must be noted that there's often a much easier way, though--base64
can be jimmied into a .eml-format file by using a mail client to create
an e-mail with a dummy attachment, then changing the contents with a
text editor and re-opening.  (This trick has actually gotten me through
some jams before!)  In this way, it helps that base64 also happens to be
extremely ubiquitous; there's almost doubtlessly an implementation
already on your machine.

Getting base64 data into a machine isn't trivial, but it can again be
argued that you have most or all of what you need at any workstation
(unless you're blind, but even then it's not out of the question).
Barcodes and data matrix standards may wax and wane, but we can
hopefully agree that OCR isn't leaving anytime soon.  Besides, even if
by some freak accident OCR were to drop off the face of the Earth, there
are still human eyes, human minds, and possibly even administrative
assistants willing to take dictation. ;-)

The translated digraph base64 in the third column would probably be easy
enough to figure out even without the translation key via some simple
"cryptanalysis" (I'm not suggesting the tr step is a cipher, but it does
act like one); if the message is clear enough to be human readable, it
itself would provide a more or less complete ciphertext-to-plaintext
mapping.

I haven't done a great deal of research into how valuable the mapping I
chose (tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z') actually is, but it's not an
entirely random choice.  In particular, it makes sure that A-Z and a-z
aren't adjacent, so that, for example, S and s don't map to an
equivalently similar upper/lower-case pair.  It probably merits more
investigation, but I want to implement the original thing first and do
some live testing to verify that there's even any problem to correct.

Probably the only complicated part is the CRC-24; you might have to be
just slightly hardcore just to memorize the XORing polynomial involved
(though the rest isn't that hard; I'm just not the "digits of pi" type).
 But that's mostly a tool for auto-correction anyway; you could get a
long way with just the first and third columns.

By the way, last night I decided to try to implement CRC-24 in awk.  It
seems to have worked.  It's not terribly efficient; I tried to stick to
POSIX rules for portability, and POSIX awk has no XOR operator.
Implementing XOR using substr is a rather humorous farce, I must say...

So, long and short, stay tuned.  I'm close to a first implementation and
test messages will be pa

Re: Proofreadable base64 (was Re: Printing Keys and using OCR.)

2007-05-29 Thread Casey Jones
Peter S. May wrote:
 > After this, the first part of the
 > line is repeated, except it is as if it were filtered
 > through the command:
 >
 > tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z'
 >
 > That is, for every "REGNADKCIN" that appears
 > on the left side, there is
 > a "H46D03A28D" on the right side.

That's a clever way of dramatically increasing the "uniqueness" of each 
character to reduce the ambiguity of the OCR. It would be useful for 
both error detection and error correction. If it could be integrated 
into the OCR engine itself, it would be even more effective. Although 
Gallager or Turbo Codes would give much better error correction for a 
given storage space, your method would be way easier to implement.

I'm leaning strongly against base64. There are just too many characters 
that can be easily confused. Base32 would be nearly as dense (5 bits 
instead of 6, per char) and would allow many tough characters to be left 
out. A simple conversion chart for base32 chars could take up just one 
line at the bottom of the page. The conversion to base32 and back would 
be very easy. Selecting the unambiguous 32 characters to use as the 
symbol set would require some care. Maybe some testing to find out which 
symbols the OCR programs get wrong most often.

> ...this wouldn't be the first time this sort of thing were done.
The only thing I've found similar is the Centinel Data Archiving Project.
http://www.cedarcreek.umn.edu/tools/t1003.html
The pdf file is a much clearer explanation than the other two.
Centinel seems to be just an error  detecting code at the beginning of 
each line. This might be good enough, but I'm starting to think that 
some error correction would be highly desirable. Even a little error 
correction could be a huge advantage over just error detection.

>  For some reason the first example that jumps to mind is 8-to-10 coding
> as used in Serial ATA.  I'm no electrical engineer, but by some
> intuition the encoding of an 8-bit word into an exactly equivalent
> 10-bit word with superior signal characteristics makes sense to me.

I think most error correction codes mix the code bits with the data 
bits. I'd like to keep the data in separate blocks to make it easy for 
humans to separate and decode it. Unfortunately separating the error 
correction bits probably makes the code less robust. If we want to 
intermix the error correction code maybe we could include a note at the 
bottom that says "the third,sixth,ninth,etc columns and rows are error 
correction data".  We also don't need the feature of hard drives and 
some signaling methods that make sure there are a good mixture of ones 
and zeros in order to keep the signal oscillating. We can have all zeros 
or all ones on paper if we want, with no signal detection problems.

I was thinking about just using a normal typewriter size font. But then 
I realized that if we use a font half the size, it would not only 
improve data density, but we could include extra error correction. A 
small font with more error correction would probably be more reliable 
than a large font with less error correction.

___
Gnupg-users mailing list
Gnupg-users@gnupg.org
http://lists.gnupg.org/mailman/listinfo/gnupg-users


Proofreadable base64 (was Re: Printing Keys and using OCR.)

2007-05-28 Thread Peter S. May
Not meaning to kick a dead thread, but this whole conversation has
gotten me thinking about how to produce an effective variant of base64
for paper storage.  Base64 is an interesting solution because it fully
encodes raw data into what is effectively printable characters.  It was
yet obviously not designed for data on paper, at least initially,
because of possible ambiguities in the glyphs it does use.

To correct this wouldn't be the first time this sort of thing were done.
 For some reason the first example that jumps to mind is 8-to-10 coding
as used in Serial ATA.  I'm no electrical engineer, but by some
intuition the encoding of an 8-bit word into an exactly equivalent
10-bit word with superior signal characteristics makes sense to me.

That said, the recipe for base64 is already well-known--each character
represents its 6-bit index in the string "A-Za-z0-9+/".  I really don't
think anyone wants to do too much messing with this elegant formula.

I've come up with something which I haven't yet tried to implement but
which I think would be interesting to try.  Let's call it "proofreadable
base64".  It's not terribly efficient, but we're going for
recoverability more than efficiency.

It goes something like this:  We can assume that each line of our medium
is capable of relaying 76 relatively legible characters.  The first 32
are data in normal base64.  Then, there is a space and a CRC-24 as
specified in OpenPGP.  Then, there are two spaces.  After this, the
first part of the line is repeated, except it is as if it were filtered
through the command:

tr 'A-Za-z0-9+/=' '0-9A-Z+/=a-z'

That is, for every "REGNADKCIN" that appears on the left side, there is
a "H46D03A28D" on the right side.

The output should be printed using a legible, fixed-width font in order
to preserve column alignment.

For our 137.5% increase in size, we've gotten a great deal of
correctability.  Firstly, every base64 character has effectively become
a less ambiguous digraph in this encoding.  It's probably easy for OCR
to confuse 0, O, o, and Q in base64, but the pairs 0/n, O/E, o/b, Q/G
are far less ambiguous.  Secondly, an equivalently disambiguated CRC-24
on each line can tell a program which lines need to be reexamined in the
first place.  Combined with the first property, this could go a long way
in helping the computer correct its own errors.  For example, if the CRC
came up incorrect, and an o/n pair appeared in the input, it would
definitely try converting the error to a 0/n pair.

Finally, in the event that this relatively simple checking mechanism is
forgotten, we can cover up the last three columns of the paper, scan it,
and try to read it in as plain base64.  (That said, we could really also
prepend the source of a checking program to the printed output. :-)

What does everyone think?

Thanks
PSM



signature.asc
Description: OpenPGP digital signature
___
Gnupg-users mailing list
Gnupg-users@gnupg.org
http://lists.gnupg.org/mailman/listinfo/gnupg-users