On Thu, Jun 12, 2014 at 1:25 PM, Robert Haas <robertmh...@gmail.com> wrote:
>
> It appears that any string starting with the letter "a" will create
> output that begins with 001S00 and the seventh character always
> appears to be 0 or 1:
>
> [rhaas ~]$ ./strxfrm en_US ab ac ad ae af a% a0 "a "
> "ab" -> "001S001T0000001S001T"
> "ac" -> "001S001U0000001S001U"
> "ad" -> "001S001V0000001S001V"
> "ae" -> "001S001W0000001S001W"
> "af" -> "001S001X0000001S001X"
> "a%" -> "001S000W0000001S000W"
> "a0" -> "001S000b0000001S000b"
> "a " -> "001S000R0000001S000R"

...

> [rhaas@hydra ~]$ ./strxfrm-binary en_US aaaaaaaa aaaaaaaA
> "aaaaaaaa" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020202 (26 bytes)
> "aaaaaaaA" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020204 (26 bytes)
>
> Still, it's fair to say that on this Linux system, the first 8 bytes
> capture a significant portion of the entropy of the first 8 bytes of
> the string, whereas on MacOS X you only get entropy from the first 2
> bytes of the string.  It would be interesting to see results from
> other platforms people might care about also.

If you knew mac's output character set with some certainty, you could
compress it rather efficiently by applying a tabulated decode back
into non-printable bytes. Say, like base64 decoding (the set appears
to be a subset of base64, but it's hard to be sure).


Ie,
x = strxfrm(s)
xz = hex(tab[x[0]] + 64*tab[x[1]] + 64^2*tab[x[2]] ... etc)

This can be made rather efficient. But the first step is defining with
some certainty the output character set.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to