Re: [Firebird-devel] String escapes for codepoints

Kjell Rilbe Wed, 25 Sep 2019 01:12:27 -0700

Den 2019-09-24 kl. 17:03, skrev Dimitry Sibiryakov:
> 24.09.2019 16:27, Kjell Rilbe wrote:
>> The built-in function ASCII_CHAR(n) seems to only accept integers 0..255
>> and not have any character set support whatsoever.
>
>   ASCII (which this function has in name) define only 127 symbols.


Yes, obviously.


>> As a workaround, do string literals include some escape syntax to insert
>> an arbitrary code point, similar to for example in C#?
>> For example:
>> '|\u0066|' = 'f'
>>
>> Or are such escape mechanisms in the plans?
>
>   Yes. Read README.hex_literals.txt in docs.
>

As far as I can see, that document concerns 1) ability to specify 
integer values using hex notation, and 2) ability to specify an 
arbitrary sequence of bytes as a string of character set octets.

While the latter would allow you to "manually" encode a unicode 
character in for example UTF-8, it's not very practical. It would be a 
lot more useful with an ability to specify the character codepoint 
inside a string literal, and have that codepoint automatically encoded 
into the string using that string's character set and encoding.

For example, the capital letter Ö with Unicode codepoint U+00D6 would be 
written as for example '\u00d6' inside an UTF-8 string literal, and 
encoded as the sequence 0xC3 0xB6. If '\u00d6' were written in an 
WIN1252 string literal, it would be encoded as a single 0xD6. If it were 
written inside a ISO8859_7 string literal that code point doesn't exist, 
and should throw a (transliteration) error.

The suggested <binary string literal> could be used to write characters 
using the literal's encoding directly. E.g. För UTF-8 literal, the 
character Ö could be written as '\xC3\xB6', and in an WIN1252 literal it 
could be written as '\xD6'.

Since these kinds of escapes would be a breaking change to how string 
literals are parsed, a solution would have to be found to determine if a 
specific string literal is to be parsed with these kinds of escapes or 
not. A prefix than could be combined with any character set prefix?

Another approach, that might suffice, would be to add a function that 
would take the codepoint and a character set and return that codepoint 
encoded in that character set. For example:

UNICODE_CHAR(0xD6 as UTF8) would return a string in UTF8 character set 
containing bytes 0xC3 0xB6.
UNICODE_CHAR(0xD6 as WIN1252) would return a string in WIN1252 character 
set containing byte 0xD6.
UNICODE_CHAR(0xD6 as ISO8859_7) would throw a transliteration error.

cast(x'C3B6' as varchar(10) character set UTF8) would return an UTF8 
string 'Ö', so the suggested <binary string literal> does solve the case 
when you want to write the character code sequence for the specific 
character set that you're using. But it doesn't help if you want to 
specify the Unicode codepoint.

Regards,
Kjell

<<attachment: kjell_rilbe.vcf>>

Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Re: [Firebird-devel] String escapes for codepoints

Reply via email to