Hi, I'm the one who asked :) Ron Johnson wrote: > On Wed, Feb 18, 2026 at 9:58 AM hubert depesz lubaczewski > <depesz(at)depesz(dot)com> > wrote: > > > Hi, > > So, we have various sha* functions. > > > > And I recently got asked about using them as a based for unique index > > over long texts. > > Normally one would do it with md5(text), but the person asking wanted to > > use sha(). > > I think I'd push back, asking them if they really need > cryptographically-secure hashing (which they most probably don't).
I indeed don't need cryptographically-secure hashing in this scenario, and I will be using md5. Nevertheless, I think using md5 in any context that isn't "to verify digests produced in the past" is a poor decision, and postgres should make using other hashes just as easy! The specific context of hashing aside, it seems weird to me that: - there is a byte-array representation of text columns, which appears to be independent of database encoding - there doesn't seem to be _any_ sane way to access this. The obvious (to a naive user, like I was) approach, casting to bytea, has exceptionally surprising behaviour: for many text strings, it does exactly what the naive user might hope for, giving back the UTF-8 representation. But multiple distinct text strings, like '\033' and '\x1b', convert to the same byte string! And text strings containing a backslash that doesn't fit the bytea hex format or the bytea escape format will fail to convert completely! The fact that convert_to() is only stable and not immutable makes sense to me given the effect that configuration can have on its behaviour, but given that there does appear to be a trivially-accessible UTF-8 representation (as used by md5()) I think there should be an immutable function that provides access to it? Is there a good reason not to? I'd be willing to send a patch for it myself. Linus
