st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule <pavel.steh...@gmail.com> napsal:
> > > st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut < > peter.eisentr...@enterprisedb.com> napsal: > >> On 2020-11-30 22:15, Pavel Stehule wrote: >> > I would like some supporting documentation on this. So far we only >> > have >> > one stackoverflow question, and then this implementation, and they >> are >> > not even the same format. My worry is that if there is not precise >> > specification, then people are going to want to add things in the >> > future, and there will be no way to analyze such requests in a >> > principled way. >> > >> > >> > I checked this and it is "prefix backslash-u hex" used by Java, >> > JavaScript or RTF - >> > https://billposer.org/Software/ListOfRepresentations.html >> >> Heh. The fact that there is a table of two dozen possible >> representations kind of proves my point that we should be deliberate in >> picking one. >> >> I do see Oracle unistr() on that list, which appears to be very similar >> to what you are trying to do here. Maybe look into aligning with that. >> > > unistr is a primitive form of proposed function. But it can be used as a > base. The format is compatible with our "4.1.2.3. String Constants with > Unicode Escapes". > > What do you think about the following proposal? > > 1. unistr(text) .. compatible with Postgres unicode escapes - it is > enhanced against Oracle, because Oracle's unistr doesn't support 6 digits > unicodes. > > 2. there can be optional parameter "prefix" with default "\". But with > "\u" it can be compatible with Java or Python. > > What do you think about it? > I thought about it a little bit more, and the prefix specification has not too much sense (more if we implement this functionality as function "unistr"). I removed the optional argument and renamed the function to "unistr". The functionality is the same. Now it supports Oracle convention, Java and Python (for Python UXXXXXXXX) and \+XXXXXX. These formats was already supported. The compatibility witth Oracle is nice. postgres=# select 'Arabic : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' ) || ' Chinese : ' || unistr( '\4E2D\6587' ) || ' English : ' || unistr( 'English' ) || ' French : ' || unistr( 'Fran\00E7ais' ) || ' German : ' || unistr( 'Deutsch' ) || ' Greek : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) || ' Hebrew : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' ) || ' Japanese : ' || unistr( '\65E5\672C\8A9E' ) || ' Korean : ' || unistr( '\D55C\AD6D\C5B4' ) || ' Portuguese : ' || unistr( 'Portugu\00EAs' ) || ' Russian : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' ) || ' Spanish : ' || unistr( 'Espa\00F1ol' ) || ' Thai : ' || unistr( '\0E44\0E17\0E22' ) as unicode_test_string; ┌──────────────────────────┐ │ unicode_test_string │ ╞══════════════════════════╡ │ Arabic : العربية ↵│ │ Chinese : 中文 ↵│ │ English : English ↵│ │ French : Français ↵│ │ German : Deutsch ↵│ │ Greek : Ελληνικά ↵│ │ Hebrew : עברית ↵│ │ Japanese : 日本語 ↵│ │ Korean : 한국어 ↵│ │ Portuguese : Português↵│ │ Russian : Русский ↵│ │ Spanish : Español ↵│ │ Thai : ไทย │ └──────────────────────────┘ (1 row) postgres=# SELECT UNISTR('Odpov\u011Bdn\u00E1 osoba'); ┌─────────────────┐ │ unistr │ ╞═════════════════╡ │ Odpovědná osoba │ └─────────────────┘ (1 row) New patch attached Regards Pavel > Pavel >
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index df29af6371..6ad8136523 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -3553,6 +3553,34 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue> </para></entry> </row> + <row> + <entry role="func_table_entry"><para role="func_signature"> + <indexterm> + <primary>unistr</primary> + </indexterm> + <function>unistr</function> ( <parameter>string</parameter> <type>text</type> ) + <returnvalue>text</returnvalue> + </para> + <para> + Evaluate escaped unicode chars (4 or 6 digits) without prefix or + with prefix <literal>u</literal> (4 digits) or with prefix + <literal>U</literal> (8 digits) to chars or with prefix + <literal>+</literal> (6 digits). + </para> + <para> + <literal>unistr('\0441\043B\043E\043D')</literal> + <returnvalue>слон</returnvalue> + </para> + <para> + <literal>unistr('d\0061t\+000061')</literal> + <returnvalue>data</returnvalue> + </para> + <para> + <literal>unistr('d\u0061t\U00000061')</literal> + <returnvalue>data</returnvalue> + </para></entry> + </row> + </tbody> </tgroup> </table> diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c index be86eb37fe..cbddb61396 100644 --- a/src/backend/parser/parser.c +++ b/src/backend/parser/parser.c @@ -278,30 +278,6 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner) return cur_token; } -/* convert hex digit (caller should have verified that) to value */ -static unsigned int -hexval(unsigned char c) -{ - if (c >= '0' && c <= '9') - return c - '0'; - if (c >= 'a' && c <= 'f') - return c - 'a' + 0xA; - if (c >= 'A' && c <= 'F') - return c - 'A' + 0xA; - elog(ERROR, "invalid hexadecimal digit"); - return 0; /* not reached */ -} - -/* is Unicode code point acceptable? */ -static void -check_unicode_value(pg_wchar c) -{ - if (!is_valid_unicode_codepoint(c)) - ereport(ERROR, - (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("invalid Unicode escape value"))); -} - /* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */ static bool check_uescapechar(unsigned char escape) diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c index d07cbafcee..b39dde12bd 100644 --- a/src/backend/parser/scansup.c +++ b/src/backend/parser/scansup.c @@ -125,3 +125,27 @@ scanner_isspace(char ch) return true; return false; } + +/* convert hex digit (caller should have verified that) to value */ +unsigned int +hexval(unsigned char c) +{ + if (c >= '0' && c <= '9') + return c - '0'; + if (c >= 'a' && c <= 'f') + return c - 'a' + 0xA; + if (c >= 'A' && c <= 'F') + return c - 'A' + 0xA; + elog(ERROR, "invalid hexadecimal digit"); + return 0; /* not reached */ +} + +/* is Unicode code point acceptable? */ +void +check_unicode_value(pg_wchar c) +{ + if (!is_valid_unicode_codepoint(c)) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("invalid Unicode escape value"))); +} diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index ff9bf238f3..e16f0875d6 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -6290,3 +6290,202 @@ unicode_is_normalized(PG_FUNCTION_ARGS) PG_RETURN_BOOL(result); } + +/* + * First four chars should be hexnum digits + */ +static bool +isxdigit_four(const char *instr) +{ + return isxdigit((unsigned char) instr[0]) && + isxdigit((unsigned char) instr[1]) && + isxdigit((unsigned char) instr[2]) && + isxdigit((unsigned char) instr[3]); +} + +/* + * Translate string with hexadecimal digits to number + */ +static long int +hexval_four(const char *instr) +{ + return (hexval(instr[0]) << 12) + + (hexval(instr[1]) << 8) + + (hexval(instr[2]) << 4) + + hexval(instr[3]); +} + +/* + * Replaces unicode escape sequences by unicode chars + */ +Datum +unistr(PG_FUNCTION_ARGS) +{ + StringInfoData str; + text *input_text; + text *result; + pg_wchar pair_first = 0; + char cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1]; + char *instr; + int len; + + /* when input string is NULL, then result is NULL too */ + if (PG_ARGISNULL(0)) + PG_RETURN_NULL(); + + input_text = PG_GETARG_TEXT_PP(0); + instr = VARDATA_ANY(input_text); + len = VARSIZE_ANY_EXHDR(input_text); + + initStringInfo(&str); + + while (len > 0) + { + if (instr[0] == '\\') + { + if (len >= 2 && + instr[1] == '\\') + { + if (pair_first) + goto invalid_pair; + appendStringInfoChar(&str, '\\'); + instr += 2; + len -= 2; + } + else if ((len >= 5 && isxdigit_four(&instr[1])) || + (len >= 6 && instr[1] == 'u' && isxdigit_four(&instr[2]))) + { + pg_wchar unicode; + int offset = instr[1] == 'u' ? 2 : 1; + + unicode = hexval_four(instr + offset); + + check_unicode_value(unicode); + + if (pair_first) + { + if (is_utf16_surrogate_second(unicode)) + { + unicode = surrogate_pair_to_codepoint(pair_first, unicode); + pair_first = 0; + } + else + goto invalid_pair; + } + else if (is_utf16_surrogate_second(unicode)) + goto invalid_pair; + + if (is_utf16_surrogate_first(unicode)) + pair_first = unicode; + else + { + pg_unicode_to_server(unicode, (unsigned char *) cbuf); + appendStringInfoString(&str, cbuf); + } + + instr += 4 + offset; + len -= 4 + offset; + } + else if (len >= 8 && + instr[1] == '+' && + isxdigit_four(&instr[2]) && + isxdigit((unsigned char) instr[6]) && + isxdigit((unsigned char) instr[7])) + { + pg_wchar unicode; + + unicode = (hexval_four(&instr[2]) << 8) + + (hexval(instr[6]) << 4) + + hexval(instr[7]); + + check_unicode_value(unicode); + + if (pair_first) + { + if (is_utf16_surrogate_second(unicode)) + { + unicode = surrogate_pair_to_codepoint(pair_first, unicode); + pair_first = 0; + } + else + goto invalid_pair; + } + else if (is_utf16_surrogate_second(unicode)) + goto invalid_pair; + + if (is_utf16_surrogate_first(unicode)) + pair_first = unicode; + else + { + pg_unicode_to_server(unicode, (unsigned char *) cbuf); + appendStringInfoString(&str, cbuf); + } + + instr += 8; + len -= 8; + } + else if (len >= 10 && + instr[1] == 'U' && + isxdigit_four(&instr[2]) && + isxdigit_four(&instr[6])) + { + pg_wchar unicode; + + unicode = (hexval_four(&instr[2]) << 16) + hexval_four(&instr[6]); + + check_unicode_value(unicode); + + if (pair_first) + { + if (is_utf16_surrogate_second(unicode)) + { + unicode = surrogate_pair_to_codepoint(pair_first, unicode); + pair_first = 0; + } + else + goto invalid_pair; + } + else if (is_utf16_surrogate_second(unicode)) + goto invalid_pair; + + if (is_utf16_surrogate_first(unicode)) + pair_first = unicode; + else + { + pg_unicode_to_server(unicode, (unsigned char *) cbuf); + appendStringInfoString(&str, cbuf); + } + + instr += 10; + len -= 10; + } + else + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("invalid Unicode escape"), + errhint("Unicode escapes must be \\XXXX, \\+XXXXXX, \\uXXXX or \\UXXXXXXXX."))); + } + else + { + if (pair_first) + goto invalid_pair; + + appendStringInfoChar(&str, *instr++); + len--; + } + } + + /* unfinished surrogate pair? */ + if (pair_first) + goto invalid_pair; + + result = cstring_to_text_with_len(str.data, str.len); + pfree(str.data); + + PG_RETURN_TEXT_P(result); + +invalid_pair: + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("invalid Unicode surrogate pair"))); +} diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index fc2202b843..92149b9cc2 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -11010,4 +11010,7 @@ proname => 'is_normalized', prorettype => 'bool', proargtypes => 'text text', prosrc => 'unicode_is_normalized' }, +{ oid => '9822', descr => 'unescape Unicode chars in strings', + proname => 'unistr', prorettype => 'text', proargtypes => 'text', + proisstrict => 't', prosrc => 'unistr' } ] diff --git a/src/include/parser/scansup.h b/src/include/parser/scansup.h index 5bc426660d..1481c1da01 100644 --- a/src/include/parser/scansup.h +++ b/src/include/parser/scansup.h @@ -14,6 +14,8 @@ #ifndef SCANSUP_H #define SCANSUP_H +#include "mb/pg_wchar.h" + extern char *downcase_truncate_identifier(const char *ident, int len, bool warn); @@ -24,4 +26,8 @@ extern void truncate_identifier(char *ident, int len, bool warn); extern bool scanner_isspace(char ch); +extern unsigned int hexval(unsigned char c); + +extern void check_unicode_value(pg_wchar c); + #endif /* SCANSUP_H */ diff --git a/src/test/regress/expected/unicode.out b/src/test/regress/expected/unicode.out index 2a1e903696..778ef6e696 100644 --- a/src/test/regress/expected/unicode.out +++ b/src/test/regress/expected/unicode.out @@ -79,3 +79,30 @@ ORDER BY num; SELECT is_normalized('abc', 'def'); -- run-time error ERROR: invalid normalization form: def +SELECT unistr('\0441\043B\043E\043D'); + unistr +-------- + слон +(1 row) + +SELECT unistr('d\u0061t\U00000061'); + unistr +-------- + data +(1 row) + +-- run-time error +SELECT unistr('wrong: \db99'); +ERROR: invalid Unicode surrogate pair +SELECT unistr('wrong: \db99\0061'); +ERROR: invalid Unicode surrogate pair +SELECT unistr('wrong: \+00db99\+000061'); +ERROR: invalid Unicode surrogate pair +SELECT unistr('wrong: \+2FFFFF'); +ERROR: invalid Unicode escape value +SELECT unistr('wrong: \udb99\u0061'); +ERROR: invalid Unicode surrogate pair +SELECT unistr('wrong: \U0000db99\U00000061'); +ERROR: invalid Unicode surrogate pair +SELECT unistr('wrong: \U002FFFFF'); +ERROR: invalid Unicode escape value diff --git a/src/test/regress/sql/unicode.sql b/src/test/regress/sql/unicode.sql index ccfc6fa77a..546e85f8cd 100644 --- a/src/test/regress/sql/unicode.sql +++ b/src/test/regress/sql/unicode.sql @@ -30,3 +30,15 @@ FROM ORDER BY num; SELECT is_normalized('abc', 'def'); -- run-time error + +SELECT unistr('\0441\043B\043E\043D'); +SELECT unistr('d\u0061t\U00000061'); + +-- run-time error +SELECT unistr('wrong: \db99'); +SELECT unistr('wrong: \db99\0061'); +SELECT unistr('wrong: \+00db99\+000061'); +SELECT unistr('wrong: \+2FFFFF'); +SELECT unistr('wrong: \udb99\u0061'); +SELECT unistr('wrong: \U0000db99\U00000061'); +SELECT unistr('wrong: \U002FFFFF');