On Thu, Nov 03, 2005 at 12:28:02PM -0500, [EMAIL PROTECTED] wrote: > It's unfortunate that the length is encoded multiple times. In UTF-8, > for instance, each character has its length encoded in the most > significant bits. Complicated to extract, however, the data is encoded > twice. 1 in the header, and 1 in the combination between the column > attribute, and the per character lengths. > > For "other databases", the column could be encoded as 2 byte characters > or 4 byte characters, allowing it to be fixed. I find myself doubting > that ASCII characters could be encoded more efficiently in such formats, > than the inclusion of a length header and per character length encoding, > but for multibyte characters, the race is probably even. :-)
That's called UTF-16 and is currently not supported by PostgreSQL at all. That may change, since the locale library ICU requires UTF-16 for everything. The question is, if someone declares a field CHAR(20), do they really mean to fix 40 bytes of storage for each and every row? I doubt it, that's even more wasteful of space than a varlena header. Which puts you right back to variable length fields. > I dunno... no opinion on the matter here, but I did want to point out > that the field can be fixed length without a header. Those proposing such > a change, however, should accept that this may result in an overall > expense. The only time this may be useful is for *very* short fields, in the order of 4 characters or less. Else the overhead swamps the varlena header... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
pgpCBW20jvcWQ.pgp
Description: PGP signature