Re: [Firebird-devel] Charset introducers

Mark Rotteveel Fri, 14 Aug 2020 06:37:48 -0700

On 14-08-2020 14:21, Dimitry Sibiryakov wrote:

14.08.2020 13:56, Mark Rotteveel wrote:
b) Otherwise, there shall be no <separator> between the <introducer>and the <character set specification>, and the set of characterscontained in the <character string literal> shall be wholly containedin the character set specified by the <character set specification>.
a) If the <character string literal> specifies a <character setspecification>, then the character set specified by that <characterset specification>.
As I read it, "character string literal" is referring to the resultof whole construction "introducer + character representation" and thusdescribe the final form of the string.
   At the same time <character representation> is described as

<character representation>    ::=   <nonquote character> | <quote symbol>
<nonquote character>    ::=   !! See the Syntax Rules.

   What are syntax rules for "nonquote character"?


The relevant syntax rule is:

"""
15) A <nonquote character> is one of:
a) Any character of the source language character set other than a <quote>.

b) Any character other than a <quote> in the character set identified bythe <character set specification> or implied by “N”.

"""

Which also confirms that the current Firebird behaviour is correct.

In support, rule 14 says:

"""

14) Each <character representation> is a character of the sourcelanguage character set. The value of a <character string literal>,viewed as a string in the source language character set, shall beequivalent to a character string of the implicit or explicit characterset of the <character string literal> or <national character stringliteral>.

"""

Which I read to mean that if I want to represent the string 'ж' inWIN1251, while using connection character set WIN1252, then you need touse _win1251 'æ' ('ж' is 0xE6 in WIN1251, 'æ' is 0xE6 in WIN1252).


So, in a way this behaves as a cast WIN1252 -> OCTETS -> WIN1251.

If you're using connection character set UTF8, it is impossible to dothis because 0xE6 is not a valid UTF-8 byte-sequence.

As I read, the current behaviour of Firebird is correct, it is justdamn awkward to achieve.
Taking into account uncertainty of the standard text cast-like usagealso may fit. In (most widespread AFAIU) cases like "_win12510x'e0e1e2'" there is actually no difference between them because thefinal result is a string in character set OCTETS being casted toWIN1251. The difference can only been seen with "_win1251 'абв'" whichresult is barely predicable.

Things like _win1251 x'....' are not defined by the SQL standard forbinary string literals, that is a non-standard extension of Firebird.

As an application developer I would prefer <character representation>to be always treated as having connection character set.

That is simple to achieve: **don't use introducers**. String literalswithout a character set specification are by definition in theconnection character set. I guess this is not what you actually meant

However, if you want to use introducers as a cast, then Firebird shouldimplement unicode literals. Unicode literals define strings in theconnection character set (unless an introducer is used), with the addedbenefit of allowing you to use unicode escapes.

So for example, when connecting with WIN1252, the string U&'ABC' hascharacter set WIN1252, and when using unicode escapes, you can onlydefine characters from WIN1252 (using their unicode codepoints). So,U&'AB\20ac' ('AB€') is valid, but U&'AB\0436' is not as that characterdoes not exist in WIN1252, but _win1251 U&'AB\0436' is valid ('ABж'),and with connection character set WIN1251 (or UTF8), the literalU&'AB\0436' is also valid.


Mark
--
Mark Rotteveel


Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Re: [Firebird-devel] Charset introducers

Reply via email to