On 14-08-2020 14:21, Dimitry Sibiryakov wrote:
14.08.2020 13:56, Mark Rotteveel wrote:
b) Otherwise, there shall be no <separator> between the <introducer>
and the <character set specification>, and the set of characters
contained in the <character string literal> shall be wholly contained
in the character set specified by the <character set specification>.
a) If the <character string literal> specifies a <character set
specification>, then the character set specified by that <character
set specification>.
As I read it, "character string literal" is referring to the result
of whole construction "introducer + character representation" and thus
describe the final form of the string.
At the same time <character representation> is described as
<character representation> ::= <nonquote character> | <quote symbol>
<nonquote character> ::= !! See the Syntax Rules.
What are syntax rules for "nonquote character"?
The relevant syntax rule is:
"""
15) A <nonquote character> is one of:
a) Any character of the source language character set other than a <quote>.
b) Any character other than a <quote> in the character set identified by
the <character set specification> or implied by “N”.
"""
Which also confirms that the current Firebird behaviour is correct.
In support, rule 14 says:
"""
14) Each <character representation> is a character of the source
language character set. The value of a <character string literal>,
viewed as a string in the source language character set, shall be
equivalent to a character string of the implicit or explicit character
set of the <character string literal> or <national character string
literal>.
"""
Which I read to mean that if I want to represent the string 'ж' in
WIN1251, while using connection character set WIN1252, then you need to
use _win1251 'æ' ('ж' is 0xE6 in WIN1251, 'æ' is 0xE6 in WIN1252).
So, in a way this behaves as a cast WIN1252 -> OCTETS -> WIN1251.
If you're using connection character set UTF8, it is impossible to do
this because 0xE6 is not a valid UTF-8 byte-sequence.
As I read, the current behaviour of Firebird is correct, it is just
damn awkward to achieve.
Taking into account uncertainty of the standard text cast-like usage
also may fit. In (most widespread AFAIU) cases like "_win1251
0x'e0e1e2'" there is actually no difference between them because the
final result is a string in character set OCTETS being casted to
WIN1251. The difference can only been seen with "_win1251 'абв'" which
result is barely predicable.
Things like _win1251 x'....' are not defined by the SQL standard for
binary string literals, that is a non-standard extension of Firebird.
As an application developer I would prefer <character representation>
to be always treated as having connection character set.
That is simple to achieve: **don't use introducers**. String literals
without a character set specification are by definition in the
connection character set. I guess this is not what you actually meant
However, if you want to use introducers as a cast, then Firebird should
implement unicode literals. Unicode literals define strings in the
connection character set (unless an introducer is used), with the added
benefit of allowing you to use unicode escapes.
So for example, when connecting with WIN1252, the string U&'ABC' has
character set WIN1252, and when using unicode escapes, you can only
define characters from WIN1252 (using their unicode codepoints). So,
U&'AB\20ac' ('AB€') is valid, but U&'AB\0436' is not as that character
does not exist in WIN1252, but _win1251 U&'AB\0436' is valid ('ABж'),
and with connection character set WIN1251 (or UTF8), the literal
U&'AB\0436' is also valid.
Mark
--
Mark Rotteveel
Firebird-Devel mailing list, web interface at
https://lists.sourceforge.net/lists/listinfo/firebird-devel