Re: proposal: unescape_text function

Pavel Stehule Wed, 02 Dec 2020 10:31:39 -0800

st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule <pavel.steh...@gmail.com>
napsal:


>
>
> st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut <
> peter.eisentr...@enterprisedb.com> napsal:
>
>> On 2020-11-30 22:15, Pavel Stehule wrote:
>> >     I would like some supporting documentation on this.  So far we only
>> >     have
>> >     one stackoverflow question, and then this implementation, and they
>> are
>> >     not even the same format.  My worry is that if there is not precise
>> >     specification, then people are going to want to add things in the
>> >     future, and there will be no way to analyze such requests in a
>> >     principled way.
>> >
>> >
>> > I checked this and it is "prefix backslash-u hex" used by Java,
>> > JavaScript  or RTF -
>> > https://billposer.org/Software/ListOfRepresentations.html
>>
>> Heh.  The fact that there is a table of two dozen possible
>> representations kind of proves my point that we should be deliberate in
>> picking one.
>>
>> I do see Oracle unistr() on that list, which appears to be very similar
>> to what you are trying to do here.  Maybe look into aligning with that.
>>
>
> unistr is a primitive form of proposed function.  But it can be used as a
> base. The format is compatible with our  "4.1.2.3. String Constants with
> Unicode Escapes".
>
> What do you think about the following proposal?
>
> 1. unistr(text) .. compatible with Postgres unicode escapes - it is
> enhanced against Oracle, because Oracle's unistr doesn't support 6 digits
> unicodes.
>
> 2. there can be optional parameter "prefix" with default "\". But with
> "\u" it can be compatible with Java or Python.
>
> What do you think about it?
>

I thought about it a little bit more, and  the prefix specification has not
too much sense (more if we implement this functionality as function
"unistr"). I removed the optional argument and renamed the function to
"unistr". The functionality is the same. Now it supports Oracle convention,
Java and Python (for Python UXXXXXXXX) and \+XXXXXX. These formats was
already supported. The compatibility witth Oracle is nice.

postgres=# select
 'Arabic     : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' )      ||
'
  Chinese    : ' || unistr( '\4E2D\6587' )                               ||
'
  English    : ' || unistr( 'English' )                                  ||
'
  French     : ' || unistr( 'Fran\00E7ais' )                             ||
'
  German     : ' || unistr( 'Deutsch' )                                  ||
'
  Greek      : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) ||
'
  Hebrew     : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' )                ||
'
  Japanese   : ' || unistr( '\65E5\672C\8A9E' )                          ||
'
  Korean     : ' || unistr( '\D55C\AD6D\C5B4' )                          ||
'
  Portuguese : ' || unistr( 'Portugu\00EAs' )                            ||
'
  Russian    : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' )      ||
'
  Spanish    : ' || unistr( 'Espa\00F1ol' )                              ||
'
  Thai       : ' || unistr( '\0E44\0E17\0E22' )
  as unicode_test_string;
┌──────────────────────────┐
│   unicode_test_string    │
╞══════════════════════════╡
│ Arabic     : العربية    ↵│
│   Chinese    : 中文     ↵│
│   English    : English  ↵│
│   French     : Français ↵│
│   German     : Deutsch  ↵│
│   Greek      : Ελληνικά ↵│
│   Hebrew     : עברית    ↵│
│   Japanese   : 日本語   ↵│
│   Korean     : 한국어   ↵│
│   Portuguese : Português↵│
│   Russian    : Русский  ↵│
│   Spanish    : Español  ↵│
│   Thai       : ไทย       │
└──────────────────────────┘
(1 row)


postgres=# SELECT UNISTR('Odpov\u011Bdn\u00E1 osoba');
┌─────────────────┐
│     unistr      │
╞═════════════════╡
│ Odpovědná osoba │
└─────────────────┘
(1 row)

New patch attached

Regards

Pavel






> Pavel
>

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index df29af6371..6ad8136523 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3553,6 +3553,34 @@ repeat('Pg', 4) <returnvalue>PgPgPgPg</returnvalue>
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>unistr</primary>
+        </indexterm>
+        <function>unistr</function> ( <parameter>string</parameter> <type>text</type> )
+        <returnvalue>text</returnvalue>
+       </para>
+       <para>
+        Evaluate escaped unicode chars (4 or 6 digits) without prefix or
+         with prefix <literal>u</literal> (4 digits) or with prefix
+        <literal>U</literal> (8 digits) to chars or with prefix
+        <literal>+</literal> (6 digits).
+       </para>
+       <para>
+        <literal>unistr('\0441\043B\043E\043D')</literal>
+        <returnvalue>слон</returnvalue>
+       </para>
+       <para>
+        <literal>unistr('d\0061t\+000061')</literal>
+        <returnvalue>data</returnvalue>
+       </para>
+       <para>
+        <literal>unistr('d\u0061t\U00000061')</literal>
+        <returnvalue>data</returnvalue>
+       </para></entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index be86eb37fe..cbddb61396 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -278,30 +278,6 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 	return cur_token;
 }
 
-/* convert hex digit (caller should have verified that) to value */
-static unsigned int
-hexval(unsigned char c)
-{
-	if (c >= '0' && c <= '9')
-		return c - '0';
-	if (c >= 'a' && c <= 'f')
-		return c - 'a' + 0xA;
-	if (c >= 'A' && c <= 'F')
-		return c - 'A' + 0xA;
-	elog(ERROR, "invalid hexadecimal digit");
-	return 0;					/* not reached */
-}
-
-/* is Unicode code point acceptable? */
-static void
-check_unicode_value(pg_wchar c)
-{
-	if (!is_valid_unicode_codepoint(c))
-		ereport(ERROR,
-				(errcode(ERRCODE_SYNTAX_ERROR),
-				 errmsg("invalid Unicode escape value")));
-}
-
 /* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
 static bool
 check_uescapechar(unsigned char escape)
diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c
index d07cbafcee..b39dde12bd 100644
--- a/src/backend/parser/scansup.c
+++ b/src/backend/parser/scansup.c
@@ -125,3 +125,27 @@ scanner_isspace(char ch)
 		return true;
 	return false;
 }
+
+/* convert hex digit (caller should have verified that) to value */
+unsigned int
+hexval(unsigned char c)
+{
+	if (c >= '0' && c <= '9')
+		return c - '0';
+	if (c >= 'a' && c <= 'f')
+		return c - 'a' + 0xA;
+	if (c >= 'A' && c <= 'F')
+		return c - 'A' + 0xA;
+	elog(ERROR, "invalid hexadecimal digit");
+	return 0;					/* not reached */
+}
+
+/* is Unicode code point acceptable? */
+void
+check_unicode_value(pg_wchar c)
+{
+	if (!is_valid_unicode_codepoint(c))
+		ereport(ERROR,
+				(errcode(ERRCODE_SYNTAX_ERROR),
+				 errmsg("invalid Unicode escape value")));
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index ff9bf238f3..e16f0875d6 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -6290,3 +6290,202 @@ unicode_is_normalized(PG_FUNCTION_ARGS)
 
 	PG_RETURN_BOOL(result);
 }
+
+/*
+ * First four chars should be hexnum digits
+ */
+static bool
+isxdigit_four(const char *instr)
+{
+	return isxdigit((unsigned char)  instr[0]) &&
+			isxdigit((unsigned char) instr[1]) &&
+			isxdigit((unsigned char) instr[2]) &&
+			isxdigit((unsigned char) instr[3]);
+}
+
+/*
+ * Translate string with hexadecimal digits to number
+ */
+static long int
+hexval_four(const char *instr)
+{
+	return (hexval(instr[0]) << 12) +
+			(hexval(instr[1]) << 8) +
+			(hexval(instr[2]) << 4) +
+			 hexval(instr[3]);
+}
+
+/*
+ * Replaces unicode escape sequences by unicode chars
+ */
+Datum
+unistr(PG_FUNCTION_ARGS)
+{
+	StringInfoData		str;
+	text	   *input_text;
+	text	   *result;
+	pg_wchar	pair_first = 0;
+	char		cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1];
+	char	   *instr;
+	int			len;
+
+	/* when input string is NULL, then result is NULL too */
+	if (PG_ARGISNULL(0))
+		PG_RETURN_NULL();
+
+	input_text = PG_GETARG_TEXT_PP(0);
+	instr = VARDATA_ANY(input_text);
+	len = VARSIZE_ANY_EXHDR(input_text);
+
+	initStringInfo(&str);
+
+	while (len > 0)
+	{
+		if (instr[0] == '\\')
+		{
+			if (len >= 2 &&
+				instr[1] == '\\')
+			{
+				if (pair_first)
+					goto invalid_pair;
+				appendStringInfoChar(&str, '\\');
+				instr += 2;
+				len -= 2;
+			}
+			else if ((len >= 5 && isxdigit_four(&instr[1])) ||
+					 (len >= 6 && instr[1] == 'u' && isxdigit_four(&instr[2])))
+			{
+				pg_wchar	unicode;
+				int			offset = instr[1] == 'u' ? 2 : 1;
+
+				unicode = hexval_four(instr + offset);
+
+				check_unicode_value(unicode);
+
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					pg_unicode_to_server(unicode, (unsigned char *) cbuf);
+					appendStringInfoString(&str, cbuf);
+				}
+
+				instr += 4 + offset;
+				len -= 4 + offset;
+			}
+			else if (len >= 8 &&
+					 instr[1] == '+' &&
+					 isxdigit_four(&instr[2]) &&
+					 isxdigit((unsigned char) instr[6]) &&
+					 isxdigit((unsigned char) instr[7]))
+			{
+				pg_wchar	unicode;
+
+				unicode = (hexval_four(&instr[2]) << 8) +
+								(hexval(instr[6]) << 4) +
+								 hexval(instr[7]);
+
+				check_unicode_value(unicode);
+
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					pg_unicode_to_server(unicode, (unsigned char *) cbuf);
+					appendStringInfoString(&str, cbuf);
+				}
+
+				instr += 8;
+				len -= 8;
+			}
+			else if (len >= 10 &&
+					 instr[1] == 'U' &&
+					 isxdigit_four(&instr[2]) &&
+					 isxdigit_four(&instr[6]))
+			{
+				pg_wchar	unicode;
+
+				unicode = (hexval_four(&instr[2]) << 16) + hexval_four(&instr[6]);
+
+				check_unicode_value(unicode);
+
+				if (pair_first)
+				{
+					if (is_utf16_surrogate_second(unicode))
+					{
+						unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+						pair_first = 0;
+					}
+					else
+						goto invalid_pair;
+				}
+				else if (is_utf16_surrogate_second(unicode))
+					goto invalid_pair;
+
+				if (is_utf16_surrogate_first(unicode))
+					pair_first = unicode;
+				else
+				{
+					pg_unicode_to_server(unicode, (unsigned char *) cbuf);
+					appendStringInfoString(&str, cbuf);
+				}
+
+				instr += 10;
+				len -= 10;
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("invalid Unicode escape"),
+						 errhint("Unicode escapes must be \\XXXX, \\+XXXXXX, \\uXXXX or \\UXXXXXXXX.")));
+		}
+		else
+		{
+			if (pair_first)
+				goto invalid_pair;
+
+			appendStringInfoChar(&str, *instr++);
+			len--;
+		}
+	}
+
+	/* unfinished surrogate pair? */
+	if (pair_first)
+		goto invalid_pair;
+
+	result = cstring_to_text_with_len(str.data, str.len);
+	pfree(str.data);
+
+	PG_RETURN_TEXT_P(result);
+
+invalid_pair:
+	ereport(ERROR,
+			(errcode(ERRCODE_SYNTAX_ERROR),
+			 errmsg("invalid Unicode surrogate pair")));
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fc2202b843..92149b9cc2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11010,4 +11010,7 @@
   proname => 'is_normalized', prorettype => 'bool', proargtypes => 'text text',
   prosrc => 'unicode_is_normalized' },
 
+{ oid => '9822', descr => 'unescape Unicode chars in strings',
+  proname => 'unistr', prorettype => 'text', proargtypes => 'text',
+  proisstrict => 't', prosrc => 'unistr' }
 ]
diff --git a/src/include/parser/scansup.h b/src/include/parser/scansup.h
index 5bc426660d..1481c1da01 100644
--- a/src/include/parser/scansup.h
+++ b/src/include/parser/scansup.h
@@ -14,6 +14,8 @@
 #ifndef SCANSUP_H
 #define SCANSUP_H
 
+#include "mb/pg_wchar.h"
+
 extern char *downcase_truncate_identifier(const char *ident, int len,
 										  bool warn);
 
@@ -24,4 +26,8 @@ extern void truncate_identifier(char *ident, int len, bool warn);
 
 extern bool scanner_isspace(char ch);
 
+extern unsigned int hexval(unsigned char c);
+
+extern void check_unicode_value(pg_wchar c);
+
 #endif							/* SCANSUP_H */
diff --git a/src/test/regress/expected/unicode.out b/src/test/regress/expected/unicode.out
index 2a1e903696..778ef6e696 100644
--- a/src/test/regress/expected/unicode.out
+++ b/src/test/regress/expected/unicode.out
@@ -79,3 +79,30 @@ ORDER BY num;
 
 SELECT is_normalized('abc', 'def');  -- run-time error
 ERROR:  invalid normalization form: def
+SELECT unistr('\0441\043B\043E\043D');
+ unistr 
+--------
+ слон
+(1 row)
+
+SELECT unistr('d\u0061t\U00000061');
+ unistr 
+--------
+ data
+(1 row)
+
+-- run-time error
+SELECT unistr('wrong: \db99');
+ERROR:  invalid Unicode surrogate pair
+SELECT unistr('wrong: \db99\0061');
+ERROR:  invalid Unicode surrogate pair
+SELECT unistr('wrong: \+00db99\+000061');
+ERROR:  invalid Unicode surrogate pair
+SELECT unistr('wrong: \+2FFFFF');
+ERROR:  invalid Unicode escape value
+SELECT unistr('wrong: \udb99\u0061');
+ERROR:  invalid Unicode surrogate pair
+SELECT unistr('wrong: \U0000db99\U00000061');
+ERROR:  invalid Unicode surrogate pair
+SELECT unistr('wrong: \U002FFFFF');
+ERROR:  invalid Unicode escape value
diff --git a/src/test/regress/sql/unicode.sql b/src/test/regress/sql/unicode.sql
index ccfc6fa77a..546e85f8cd 100644
--- a/src/test/regress/sql/unicode.sql
+++ b/src/test/regress/sql/unicode.sql
@@ -30,3 +30,15 @@ FROM
 ORDER BY num;
 
 SELECT is_normalized('abc', 'def');  -- run-time error
+
+SELECT unistr('\0441\043B\043E\043D');
+SELECT unistr('d\u0061t\U00000061');
+
+-- run-time error
+SELECT unistr('wrong: \db99');
+SELECT unistr('wrong: \db99\0061');
+SELECT unistr('wrong: \+00db99\+000061');
+SELECT unistr('wrong: \+2FFFFF');
+SELECT unistr('wrong: \udb99\u0061');
+SELECT unistr('wrong: \U0000db99\U00000061');
+SELECT unistr('wrong: \U002FFFFF');

Re: proposal: unescape_text function

Reply via email to