"Daniel Verite" <[email protected]> writes:
>> - there is a byte-array representation of text columns, which appears to
>> be independent of database encoding
>
> Not sure what you're refering to. Both the on-disk and in-memory
> representations of text/varchar are encoding-dependent.
Huh, I thought I'd come to the opposite conclusion from testing the
other day (md5('ΓΏ') giving the same results for both a utf-8 database
and a latin1 database), but I must have been holding something wrong,
because trying again does confirm this. Sorry about that!
>> The obvious (to a naive user, like I was) approach, casting to bytea,
>> has exceptionally surprising behaviour: for many text strings, it does
>> exactly what the naive user might hope for, giving back the UTF-8
>> representation. But multiple distinct text strings, like '\033' and
>> '\x1b', convert to the same byte string! And text strings containing a
>> backslash that doesn't fit the bytea hex format or the bytea escape
>> format will fail to convert completely!
>
> Yes. It seems a common mistake to forget or ignore that
> backslashes are special in the input text representation of bytea.
> It might be not obvious from reading the doc at [1]
> but we just need to quote backslashes by doubling them.
>
> AFAIK a working solution for the OP would be:
> sha256(replace(colname, '\', '\\')::bytea)
That's so simple that it feels like it should have been
obvious... Thanks! Nevertheless, it feels like a weird and potentially
computationally expensive contortion for getting from a text value to
its bytea representation.
Given their encoding-dependence, shouldn't md5() (and ::bytea, for that
matter) be marked only STABLE, not IMMUTABLE? Or is IMMUTABLE's
> guaranteed to return the same results given the same arguments forever
only valid within the context of a single database?
md5() and ::bytea aside, it seems like it would be good to have both
- a function for accessing the bytea representation of a text value
without any extra steps, and
- an IMMUTABLE function for getting a UTF-8 (specifically UTF-8,
because it's the only encoding with a character set encompassing all
other supported character sets) bytea representation of a text value
regardless of the database encoding?
Cheers
Linus