On 14-08-2020 14:21, Dimitry Sibiryakov wrote:
14.08.2020 13:56, Mark Rotteveel wrote:
b) Otherwise, there shall be no <separator> between the <introducer> and the <character set specification>, and the set of characters contained in the <character string literal> shall be wholly contained in the character set specified by the <character set specification>.

a) If the <character string literal> specifies a <character set specification>, then the character set specified by that <character set specification>.

  As I read it, "character string literal" is referring to the result of whole construction "introducer + character representation" and thus describe the final form of the string.
   At the same time <character representation> is described as

<character representation>    ::=   <nonquote character> | <quote symbol>
<nonquote character>    ::=   !! See the Syntax Rules.

   What are syntax rules for "nonquote character"?

The relevant syntax rule is:

"""
15) A <nonquote character> is one of:
a) Any character of the source language character set other than a <quote>.
b) Any character other than a <quote> in the character set identified by the <character set specification> or implied by “N”.
"""

Which also confirms that the current Firebird behaviour is correct.

In support, rule 14 says:

"""
14) Each <character representation> is a character of the source language character set. The value of a <character string literal>, viewed as a string in the source language character set, shall be equivalent to a character string of the implicit or explicit character set of the <character string literal> or <national character string literal>.
"""

Which I read to mean that if I want to represent the string 'ж' in WIN1251, while using connection character set WIN1252, then you need to use _win1251 'æ' ('ж' is 0xE6 in WIN1251, 'æ' is 0xE6 in WIN1252).

So, in a way this behaves as a cast WIN1252 -> OCTETS -> WIN1251.

If you're using connection character set UTF8, it is impossible to do this because 0xE6 is not a valid UTF-8 byte-sequence.

As I read, the current behaviour of Firebird is correct, it is just damn awkward to achieve.

  Taking into account uncertainty of the standard text cast-like usage also may fit. In (most widespread AFAIU) cases like "_win1251 0x'e0e1e2'" there is actually no difference between them because the final result is a string in character set OCTETS being casted to WIN1251. The difference can only been seen with "_win1251 'абв'" which result is barely predicable.

Things like _win1251 x'....' are not defined by the SQL standard for binary string literals, that is a non-standard extension of Firebird.

  As an application developer I would prefer <character representation> to be always treated as having connection character set.

That is simple to achieve: **don't use introducers**. String literals without a character set specification are by definition in the connection character set. I guess this is not what you actually meant

However, if you want to use introducers as a cast, then Firebird should implement unicode literals. Unicode literals define strings in the connection character set (unless an introducer is used), with the added benefit of allowing you to use unicode escapes.

So for example, when connecting with WIN1252, the string U&'ABC' has character set WIN1252, and when using unicode escapes, you can only define characters from WIN1252 (using their unicode codepoints). So, U&'AB\20ac' ('AB€') is valid, but U&'AB\0436' is not as that character does not exist in WIN1252, but _win1251 U&'AB\0436' is valid ('ABж'), and with connection character set WIN1251 (or UTF8), the literal U&'AB\0436' is also valid.

Mark
--
Mark Rotteveel


Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Reply via email to