On 03/08/16 20:14, Álvaro Hernández Tortosa wrote:
On 03/08/16 17:47, Kevin Grittner wrote:
On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa
<a...@8kdata.com> wrote:
What would it take to support it?
Would it be of any value to support "Modified UTF-8"?
https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
That's nice, but I don't think so.
The problem is that you cannot predict how people would send you
data, like when importing from other databases. I guess it may work if
Postgres would implement such UTF-8 variant and also the drivers, but
that would still require an encoding conversion (i.e., parsing every
string) to change the 0x00, which seems like a serious performance hit.
It could be worse than nothing, though!
Thanks,
Álvaro
It may indeed work.
According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code
point. In short, the \u00 nul is represented (overlong encoding) by the
two-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8
so should not appear in an otherwise valid UTF-8 string. Yet they are
accepted by Postgres (like if Postgres would support Modified UTF-8
intentionally). The caracter in psql does not render as a nul but as
this symbol: "삀".
Given that this works, the process would look like this:
- Parse all input data looking for bytes with hex value 0x00. If they
appear in the string, they are the null byte.
- Replace that byte with the two bytes 0xc080.
- Reverse the operation when reading.
This is OK but of course a performance hit (searching for 0x00 and
then augmenting the byte[] or whatever data structure to account for the
extra byte). A little bit of a PITA, but I guess better than fixing it
all :)
Álvaro
--
Álvaro Hernández Tortosa
-----------
8Kdata
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers