On Thu, Jun 12, 2014 at 1:25 PM, Robert Haas <robertmh...@gmail.com> wrote: > > It appears that any string starting with the letter "a" will create > output that begins with 001S00 and the seventh character always > appears to be 0 or 1: > > [rhaas ~]$ ./strxfrm en_US ab ac ad ae af a% a0 "a " > "ab" -> "001S001T0000001S001T" > "ac" -> "001S001U0000001S001U" > "ad" -> "001S001V0000001S001V" > "ae" -> "001S001W0000001S001W" > "af" -> "001S001X0000001S001X" > "a%" -> "001S000W0000001S000W" > "a0" -> "001S000b0000001S000b" > "a " -> "001S000R0000001S000R"
... > [rhaas@hydra ~]$ ./strxfrm-binary en_US aaaaaaaa aaaaaaaA > "aaaaaaaa" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020202 (26 bytes) > "aaaaaaaA" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020204 (26 bytes) > > Still, it's fair to say that on this Linux system, the first 8 bytes > capture a significant portion of the entropy of the first 8 bytes of > the string, whereas on MacOS X you only get entropy from the first 2 > bytes of the string. It would be interesting to see results from > other platforms people might care about also. If you knew mac's output character set with some certainty, you could compress it rather efficiently by applying a tabulated decode back into non-printable bytes. Say, like base64 decoding (the set appears to be a subset of base64, but it's hard to be sure). Ie, x = strxfrm(s) xz = hex(tab[x[0]] + 64*tab[x[1]] + 64^2*tab[x[2]] ... etc) This can be made rather efficient. But the first step is defining with some certainty the output character set. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers