Re: proposal: unescape_text function

2021-03-29 Thread Pavel Stehule
po 29. 3. 2021 v 12:19 odesílatel Peter Eisentraut <
peter.eisentr...@enterprisedb.com> napsal:

> On 25.03.21 10:44, Peter Eisentraut wrote:
> >
> > On 10.03.21 14:52, David Steele wrote:
> >>> I thought about it a little bit more, and  the prefix specification
> >>> has not too much sense (more if we implement this functionality as
> >>> function "unistr"). I removed the optional argument and renamed the
> >>> function to "unistr". The functionality is the same. Now it supports
> >>> Oracle convention, Java and Python (for Python U) and
> >>> \+XX. These formats was already supported.The compatibility witth
> >>> Oracle is nice.
> >>
> >> Peter, it looks like Pavel has aligned this function with unistr() as
> >> you suggested. Thoughts?
> >
> > I haven't read through the patch in detail yet, but I support the
> > proposed details of the functionality.
>
> Committed.
>
> I made two major changes:  I moved the tests from unicode.sql to
> strings.sql.  The first file is for tests that only work in UTF8
> encoding, which is not the case here.  Also, I wasn't comfortable with
> exposing little utility functions from the parser in an ad hoc way.  So
> I made local copies, which also allows us to make more
> locally-appropriate error messages.  I think there is some potential for
> refactoring here (see also src/common/hex.c), but that's perhaps better
> done separately and more comprehensively.
>

Thank you very much

Pavel


Re: proposal: unescape_text function

2021-03-29 Thread Peter Eisentraut

On 25.03.21 10:44, Peter Eisentraut wrote:


On 10.03.21 14:52, David Steele wrote:
I thought about it a little bit more, and  the prefix specification 
has not too much sense (more if we implement this functionality as 
function "unistr"). I removed the optional argument and renamed the 
function to "unistr". The functionality is the same. Now it supports 
Oracle convention, Java and Python (for Python U) and 
\+XX. These formats was already supported.The compatibility witth 
Oracle is nice.


Peter, it looks like Pavel has aligned this function with unistr() as 
you suggested. Thoughts?


I haven't read through the patch in detail yet, but I support the 
proposed details of the functionality.


Committed.

I made two major changes:  I moved the tests from unicode.sql to 
strings.sql.  The first file is for tests that only work in UTF8 
encoding, which is not the case here.  Also, I wasn't comfortable with 
exposing little utility functions from the parser in an ad hoc way.  So 
I made local copies, which also allows us to make more 
locally-appropriate error messages.  I think there is some potential for 
refactoring here (see also src/common/hex.c), but that's perhaps better 
done separately and more comprehensively.





Re: proposal: unescape_text function

2021-03-25 Thread Peter Eisentraut



On 10.03.21 14:52, David Steele wrote:
I thought about it a little bit more, and  the prefix specification 
has not too much sense (more if we implement this functionality as 
function "unistr"). I removed the optional argument and renamed the 
function to "unistr". The functionality is the same. Now it supports 
Oracle convention, Java and Python (for Python U) and 
\+XX. These formats was already supported.The compatibility witth 
Oracle is nice.


Peter, it looks like Pavel has aligned this function with unistr() as 
you suggested. Thoughts?


I haven't read through the patch in detail yet, but I support the 
proposed details of the functionality.





Re: proposal: unescape_text function

2021-03-10 Thread David Steele

On 12/2/20 1:30 PM, Pavel Stehule wrote:
st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut


Heh.  The fact that there is a table of two dozen possible
representations kind of proves my point that we should be
deliberate in
picking one.

I do see Oracle unistr() on that list, which appears to be very
similar
to what you are trying to do here.  Maybe look into aligning
with that.

unistr is a primitive form of proposed function.  But it can be used
as a base. The format is compatible with our  "4.1.2.3. String
Constants with Unicode Escapes".

What do you think about the following proposal?

1. unistr(text) .. compatible with Postgres unicode escapes - it is
enhanced against Oracle, because Oracle's unistr doesn't support 6
digits unicodes.

2. there can be optional parameter "prefix" with default "\". But
with "\u" it can be compatible with Java or Python.

What do you think about it?

I thought about it a little bit more, and  the prefix specification has 
not too much sense (more if we implement this functionality as function 
"unistr"). I removed the optional argument and renamed the function to 
"unistr". The functionality is the same. Now it supports Oracle 
convention, Java and Python (for Python U) and \+XX. These 
formats was already supported.The compatibility witth Oracle is nice.


Peter, it looks like Pavel has aligned this function with unistr() as 
you suggested. Thoughts?


Regards,
--
-David
da...@pgmasters.net




Re: proposal: unescape_text function

2020-12-02 Thread Bruce Momjian
On Wed, Dec  2, 2020 at 07:30:39PM +0100, Pavel Stehule wrote:
> postgres=# select
>  'Arabic     : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' )      || '
>   Chinese    : ' || unistr( '\4E2D\6587' )                               || '
>   English    : ' || unistr( 'English' )                                  || '
>   French     : ' || unistr( 'Fran\00E7ais' )                             || '
>   German     : ' || unistr( 'Deutsch' )                                  || '
>   Greek      : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) || '
>   Hebrew     : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' )                || '
>   Japanese   : ' || unistr( '\65E5\672C\8A9E' )                          || '
>   Korean     : ' || unistr( '\D55C\AD6D\C5B4' )                          || '
>   Portuguese : ' || unistr( 'Portugu\00EAs' )                            || '
>   Russian    : ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' )      || '
>   Spanish    : ' || unistr( 'Espa\00F1ol' )                              || '
>   Thai       : ' || unistr( '\0E44\0E17\0E22' )
>   as unicode_test_string;
> ┌──┐
> │   unicode_test_string    │
> ╞══╡
> │ Arabic     : العربية    ↵│
> │   Chinese    : 中文     ↵│
> │   English    : English  ↵│
> │   French     : Français ↵│
> │   German     : Deutsch  ↵│
> │   Greek      : Ελληνικά ↵│
> │   Hebrew     : עברית    ↵│
> │   Japanese   : 日本語   ↵│
> │   Korean     : 한국어   ↵│
> │   Portuguese : Português↵│
> │   Russian    : Русский  ↵│
> │   Spanish    : Español  ↵│
> │   Thai       : ไทย       │
> └──┘

Offlist, this table output is super-cool!

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: proposal: unescape_text function

2020-12-02 Thread Pavel Stehule
st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule 
napsal:

>
>
> st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut <
> peter.eisentr...@enterprisedb.com> napsal:
>
>> On 2020-11-30 22:15, Pavel Stehule wrote:
>> > I would like some supporting documentation on this.  So far we only
>> > have
>> > one stackoverflow question, and then this implementation, and they
>> are
>> > not even the same format.  My worry is that if there is not precise
>> > specification, then people are going to want to add things in the
>> > future, and there will be no way to analyze such requests in a
>> > principled way.
>> >
>> >
>> > I checked this and it is "prefix backslash-u hex" used by Java,
>> > JavaScript  or RTF -
>> > https://billposer.org/Software/ListOfRepresentations.html
>>
>> Heh.  The fact that there is a table of two dozen possible
>> representations kind of proves my point that we should be deliberate in
>> picking one.
>>
>> I do see Oracle unistr() on that list, which appears to be very similar
>> to what you are trying to do here.  Maybe look into aligning with that.
>>
>
> unistr is a primitive form of proposed function.  But it can be used as a
> base. The format is compatible with our  "4.1.2.3. String Constants with
> Unicode Escapes".
>
> What do you think about the following proposal?
>
> 1. unistr(text) .. compatible with Postgres unicode escapes - it is
> enhanced against Oracle, because Oracle's unistr doesn't support 6 digits
> unicodes.
>
> 2. there can be optional parameter "prefix" with default "\". But with
> "\u" it can be compatible with Java or Python.
>
> What do you think about it?
>

I thought about it a little bit more, and  the prefix specification has not
too much sense (more if we implement this functionality as function
"unistr"). I removed the optional argument and renamed the function to
"unistr". The functionality is the same. Now it supports Oracle convention,
Java and Python (for Python U) and \+XX. These formats was
already supported. The compatibility witth Oracle is nice.

postgres=# select
 'Arabic : ' || unistr( '\0627\0644\0639\0631\0628\064A\0629' )  ||
'
  Chinese: ' || unistr( '\4E2D\6587' )   ||
'
  English: ' || unistr( 'English' )  ||
'
  French : ' || unistr( 'Fran\00E7ais' ) ||
'
  German : ' || unistr( 'Deutsch' )  ||
'
  Greek  : ' || unistr( '\0395\03BB\03BB\03B7\03BD\03B9\03BA\03AC' ) ||
'
  Hebrew : ' || unistr( '\05E2\05D1\05E8\05D9\05EA' )||
'
  Japanese   : ' || unistr( '\65E5\672C\8A9E' )  ||
'
  Korean : ' || unistr( '\D55C\AD6D\C5B4' )  ||
'
  Portuguese : ' || unistr( 'Portugu\00EAs' )||
'
  Russian: ' || unistr( '\0420\0443\0441\0441\043A\0438\0439' )  ||
'
  Spanish: ' || unistr( 'Espa\00F1ol' )  ||
'
  Thai   : ' || unistr( '\0E44\0E17\0E22' )
  as unicode_test_string;
┌──┐
│   unicode_test_string│
╞══╡
│ Arabic : العربية↵│
│   Chinese: 中文 ↵│
│   English: English  ↵│
│   French : Français ↵│
│   German : Deutsch  ↵│
│   Greek  : Ελληνικά ↵│
│   Hebrew : עברית↵│
│   Japanese   : 日本語   ↵│
│   Korean : 한국어   ↵│
│   Portuguese : Português↵│
│   Russian: Русский  ↵│
│   Spanish: Español  ↵│
│   Thai   : ไทย   │
└──┘
(1 row)


postgres=# SELECT UNISTR('Odpov\u011Bdn\u00E1 osoba');
┌─┐
│ unistr  │
╞═╡
│ Odpovědná osoba │
└─┘
(1 row)

New patch attached

Regards

Pavel






> Pavel
>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index df29af6371..6ad8136523 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3553,6 +3553,34 @@ repeat('Pg', 4) PgPgPgPg

   
 
+  
+   
+
+ unistr
+
+unistr ( string text )
+text
+   
+   
+Evaluate escaped unicode chars (4 or 6 digits) without prefix or
+ with prefix u (4 digits) or with prefix
+U (8 digits) to chars or with prefix
++ (6 digits).
+   
+   
+unistr('\0441\043B\043E\043D')
+слон
+   
+   
+unistr('d\0061t\+61')
+data
+   
+   
+unistr('d\u0061t\U0061')
+data
+   
+  
+
  
 

diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index be86eb37fe..cbddb61396 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -278,30 +278,6 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 	return cur_token;
 }
 
-/* convert hex digit (caller should have verified that) to value */
-static unsigned int
-hexval(unsigned char c)

Re: proposal: unescape_text function

2020-12-02 Thread Chapman Flack
On 12/02/20 09:55, Chapman Flack wrote:
> In Perl, there is a useful extension to regexp substitution where
> you specify the replacement not as a string or even a string with &
> and \1 \2 ... magic, but as essentially a lambda that is passed the
> match and returns a computed replacement. That makes conversions of
> the sort discussed here generally trivial to implement.

Python, I should have added, allows that also. Java too, since release 9.

Regards,
-Chap




Re: proposal: unescape_text function

2020-12-02 Thread Chapman Flack
On 12/02/20 05:37, Pavel Stehule wrote:
> 2. there can be optional parameter "prefix" with default "\". But with "\u"
> it can be compatible with Java or Python.

Java's unicode escape form is one of those early ones that lack
a six-digit form, and where any character outside of the basic multilingual
plane has to be represented by two four-digit escapes in a row, encoding
the two surrogates that would make up the character's representation
in UTF-16.

Obviously that's an existing form that's out there, so it's not a bad
thing to have some kind of support for it, but it's not a great
representation to encourage people to use.

Python, by contrast, has both \u and \U where you would use
the latter to represent a non-BMP character directly. So the Java and
Python schemes should be considered distinct.

In Perl, there is a useful extension to regexp substitution where
you specify the replacement not as a string or even a string with &
and \1 \2 ... magic, but as essentially a lambda that is passed the
match and returns a computed replacement. That makes conversions of
the sort discussed here generally trivial to implement. Would it be
worth considering to add something of general utility like that, and
then there could be a small library of pure SQL functions (or a wiki
page or GitHub gist) covering a bunch of the two dozen representations
on that page linked above?

Regards,
-Chap




Re: proposal: unescape_text function

2020-12-02 Thread Andrew Dunstan


On 12/2/20 12:48 AM, Pavel Stehule wrote:
>
>
> st 2. 12. 2020 v 0:05 odesílatel Andrew Dunstan  > napsal:
>
>
> On 11/30/20 8:14 AM, Peter Eisentraut wrote:
> > On 2020-11-29 18:36, Pavel Stehule wrote:
> >>
> >>     I don't really get the point of this function.  There is
> AFAICT no
> >>     function to produce this escaped format, and it's not a
> recognized
> >>     interchange format.  So under what circumstances would one
> need to
> >>     use this?
> >>
> >>
> >> Some corporate data can be in CSV format with escaped unicode
> >> characters. Without this function it is not possible to decode
> these
> >> files without external application.
> >
> > I would like some supporting documentation on this.  So far we only
> > have one stackoverflow question, and then this implementation, and
> > they are not even the same format.  My worry is that if there is not
> > precise specification, then people are going to want to add
> things in
> > the future, and there will be no way to analyze such requests in a
> > principled way.
> >
> >
> >
>
>
> Also, should this be an extension? I'm dubious about including such
> marginal uses in the core code unless there's a really good case
> for it.
>
>
>
[...]
> 3. there are new disadvantages of extensions in current DBaaS times.
> Until the extension is not directly accepted by a cloud provider, then
> the extension is not available for users. The acceptance of extensions
> is not too agile - so moving this code to extension doesn't solve this
> problem. Without DBaaS the implementation of this feature as the
> extensions can be good enough.
>
>

That argument can apply to any extension someone wants to use. If your
DBaaS provider doesn't support some extension you need to lobby them or
find another that does support it, rather than try to put it in core
code. Some extensions, such as untrusted PLs,  will naturally almost
never be supported by DBaaS providers because they are inherently
unsafe. That's not the case here.


cheers


andrew






Re: proposal: unescape_text function

2020-12-02 Thread Pavel Stehule
st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut <
peter.eisentr...@enterprisedb.com> napsal:

> On 2020-11-30 22:15, Pavel Stehule wrote:
> > I would like some supporting documentation on this.  So far we only
> > have
> > one stackoverflow question, and then this implementation, and they
> are
> > not even the same format.  My worry is that if there is not precise
> > specification, then people are going to want to add things in the
> > future, and there will be no way to analyze such requests in a
> > principled way.
> >
> >
> > I checked this and it is "prefix backslash-u hex" used by Java,
> > JavaScript  or RTF -
> > https://billposer.org/Software/ListOfRepresentations.html
>
> Heh.  The fact that there is a table of two dozen possible
> representations kind of proves my point that we should be deliberate in
> picking one.
>
> I do see Oracle unistr() on that list, which appears to be very similar
> to what you are trying to do here.  Maybe look into aligning with that.
>

unistr is a primitive form of proposed function.  But it can be used as a
base. The format is compatible with our  "4.1.2.3. String Constants with
Unicode Escapes".

What do you think about the following proposal?

1. unistr(text) .. compatible with Postgres unicode escapes - it is
enhanced against Oracle, because Oracle's unistr doesn't support 6 digits
unicodes.

2. there can be optional parameter "prefix" with default "\". But with "\u"
it can be compatible with Java or Python.

What do you think about it?

Pavel


Re: proposal: unescape_text function

2020-12-02 Thread Peter Eisentraut

On 2020-11-30 22:15, Pavel Stehule wrote:

I would like some supporting documentation on this.  So far we only
have
one stackoverflow question, and then this implementation, and they are
not even the same format.  My worry is that if there is not precise
specification, then people are going to want to add things in the
future, and there will be no way to analyze such requests in a
principled way.


I checked this and it is "prefix backslash-u hex" used by Java, 
JavaScript  or RTF - 
https://billposer.org/Software/ListOfRepresentations.html


Heh.  The fact that there is a table of two dozen possible 
representations kind of proves my point that we should be deliberate in 
picking one.


I do see Oracle unistr() on that list, which appears to be very similar 
to what you are trying to do here.  Maybe look into aligning with that.





Re: proposal: unescape_text function

2020-12-01 Thread Pavel Stehule
st 2. 12. 2020 v 0:05 odesílatel Andrew Dunstan 
napsal:

>
> On 11/30/20 8:14 AM, Peter Eisentraut wrote:
> > On 2020-11-29 18:36, Pavel Stehule wrote:
> >>
> >> I don't really get the point of this function.  There is AFAICT no
> >> function to produce this escaped format, and it's not a recognized
> >> interchange format.  So under what circumstances would one need to
> >> use this?
> >>
> >>
> >> Some corporate data can be in CSV format with escaped unicode
> >> characters. Without this function it is not possible to decode these
> >> files without external application.
> >
> > I would like some supporting documentation on this.  So far we only
> > have one stackoverflow question, and then this implementation, and
> > they are not even the same format.  My worry is that if there is not
> > precise specification, then people are going to want to add things in
> > the future, and there will be no way to analyze such requests in a
> > principled way.
> >
> >
> >
>
>
> Also, should this be an extension? I'm dubious about including such
> marginal uses in the core code unless there's a really good case for it.
>

I am not sure, and  I am inclined so it should be core functionality.

1. Although this use case is marginal, this is related to most used
encodings - ascii and unicode. 8 bit encodings enhanced about escaped
multibyte chars will be used for a very long time. Unfortunately - this
will be worse, because Postgres will be used more in the corporate
environment, where there is a bigger press to conserve very legacy
technologies without correct multibyte support. The core problem so this
issue is out of concept bytea -> text or text -> bytea transformations
supported by Postgres. This is text -> text transformation (for almost all
encoding based on ascii), that is not supported by Postgres now.

2. Postgres already has this functionality - but unfortunately there is a
limit just only literal constants.

create or replace function uunescape(text)
returns text as $$
declare r text;
begin
  -- don't use this code!!!
  execute 'select e''' || $1 ||  into r;
  return r;
end;
$$ language plpgsql immutable;

But one way how anybody can use it is SQL injection vulnerable and slow. So
some simple buildin solution can be protection against some future security
issues. Personally I am happy with just this limited function that will be
safe (although the design based on introducing new encoding and conversions
can be more complete and accurate). I agree so this case is marginal, but
it is a fully valid use case, and supporting unicode escaped codes just by
parser is a needless limit.

3. there are new disadvantages of extensions in current DBaaS times. Until
the extension is not directly accepted by a cloud provider, then the
extension is not available for users. The acceptance of extensions is not
too agile - so moving this code to extension doesn't solve this problem.
Without DBaaS the implementation of this feature as the extensions can be
good enough.

Regards

Pavel




>
> cheers
>
>
> andrew
>
>


Re: proposal: unescape_text function

2020-12-01 Thread Andrew Dunstan


On 11/30/20 8:14 AM, Peter Eisentraut wrote:
> On 2020-11-29 18:36, Pavel Stehule wrote:
>>
>>     I don't really get the point of this function.  There is AFAICT no
>>     function to produce this escaped format, and it's not a recognized
>>     interchange format.  So under what circumstances would one need to
>>     use this?
>>
>>
>> Some corporate data can be in CSV format with escaped unicode
>> characters. Without this function it is not possible to decode these
>> files without external application.
>
> I would like some supporting documentation on this.  So far we only
> have one stackoverflow question, and then this implementation, and
> they are not even the same format.  My worry is that if there is not
> precise specification, then people are going to want to add things in
> the future, and there will be no way to analyze such requests in a
> principled way.
>
>
>


Also, should this be an extension? I'm dubious about including such
marginal uses in the core code unless there's a really good case for it.


cheers


andrew





Re: proposal: unescape_text function

2020-12-01 Thread Pavel Stehule
út 1. 12. 2020 v 20:20 odesílatel Chapman Flack 
napsal:

> >> po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule <
> pavel.steh...@gmail.com>
> >> napsal:
> >>> I checked this and it is "prefix backslash-u hex" used by Java,
> >>> JavaScript  or RTF -
> >>> https://billposer.org/Software/ListOfRepresentations.html
>
> If I look on that page, it appears that RTF is using a similar-looking
> escape but in decimal rather than hex.
>
> It would be important to define what is done with non-BMP characters?
> Will there be another escape for a six- or eight-hexdigit format for
> the codepoint, or will it be represented by two four-hexdigit escapes
> for consecutive UTF-16 surrogates?
>

the detection of decimal or hexadecimal codes can be a hard problem -
string "12" is valid in both systems, but the numbers are different. So
there should be external specification as an argument.

Regards

Pavel



> Regards,
> -Chap
>


Re: proposal: unescape_text function

2020-12-01 Thread Chapman Flack
>> po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule 
>> napsal:
>>> I checked this and it is "prefix backslash-u hex" used by Java,
>>> JavaScript  or RTF -
>>> https://billposer.org/Software/ListOfRepresentations.html

If I look on that page, it appears that RTF is using a similar-looking
escape but in decimal rather than hex.

It would be important to define what is done with non-BMP characters?
Will there be another escape for a six- or eight-hexdigit format for
the codepoint, or will it be represented by two four-hexdigit escapes
for consecutive UTF-16 surrogates?

Regards,
-Chap




Re: proposal: unescape_text function

2020-11-30 Thread Pavel Stehule
po 30. 11. 2020 v 22:56 odesílatel Pavel Stehule 
napsal:

>
>
> po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule 
> napsal:
>
>>
>>
>> po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <
>> peter.eisentr...@enterprisedb.com> napsal:
>>
>>> On 2020-11-29 18:36, Pavel Stehule wrote:
>>> >
>>> > I don't really get the point of this function.  There is AFAICT no
>>> > function to produce this escaped format, and it's not a recognized
>>> > interchange format.  So under what circumstances would one need to
>>> > use this?
>>> >
>>> >
>>> > Some corporate data can be in CSV format with escaped unicode
>>> > characters. Without this function it is not possible to decode these
>>> > files without external application.
>>>
>>> I would like some supporting documentation on this.  So far we only have
>>> one stackoverflow question, and then this implementation, and they are
>>> not even the same format.  My worry is that if there is not precise
>>> specification, then people are going to want to add things in the
>>> future, and there will be no way to analyze such requests in a
>>> principled way.
>>>
>>>
>> I checked this and it is "prefix backslash-u hex" used by Java,
>> JavaScript  or RTF -
>> https://billposer.org/Software/ListOfRepresentations.html
>>
>> In some languages (Python), there is decoder "unicode-escape". Java has
>> a method escapeJava, for conversion from unicode to ascii. I can imagine so
>> these data are from Java systems exported to 8bit strings - so this
>> implementation can be accepted as  referential. This format is used by
>> https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html
>> tool too.
>>
>> Postgres can decode this format too, and the patch is based on Postgres
>> implementation. I just implemented a different interface.
>>
>> Currently decode function does only text->bytea transformation. Maybe a
>> more generic function "decode_text" and "encode_text" for similar cases can
>> be better (here we need text->text transformation). But it looks like
>> overengineering now.
>>
>> Maybe we introduce new encoding "ascii" and we can implement new
>> conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most
>> clean solution. What do you think about it?
>>
>
> a better name of new encoding can be "unicode-escape" than "ascii". We use
> "to_ascii" function for different use case.
>
> set client_encoding to unicode-escape;
> copy tab from xxx;
> ...
>
> but it doesn't help when only a few columns from the table are in
> unicode-escape format.
>
>
probably the most complete solution can be from two steps:

1. introducing new encoding - "ascii_unicode_escape" with related
conversions
2. introducing two new functions - text_escape and text_unescape with two
parameters - source text and conversion name

select text_convert_to('Тимати', 'ascii_unicode_escape')
\u0422\u0438\u043c\u0430\u0442\u0438 .. result is text

select text_convert_from('\u0422\u0438\u043c\u0430\u0442\u0438',
'ascii_unicode_escape')
┌──┐
│ ?column? │
╞══╡
│ Тимати   │
└──┘
(1 row)


>
>
>> Regards
>>
>> Pavel
>>
>>
>>


Re: proposal: unescape_text function

2020-11-30 Thread Pavel Stehule
po 30. 11. 2020 v 22:15 odesílatel Pavel Stehule 
napsal:

>
>
> po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <
> peter.eisentr...@enterprisedb.com> napsal:
>
>> On 2020-11-29 18:36, Pavel Stehule wrote:
>> >
>> > I don't really get the point of this function.  There is AFAICT no
>> > function to produce this escaped format, and it's not a recognized
>> > interchange format.  So under what circumstances would one need to
>> > use this?
>> >
>> >
>> > Some corporate data can be in CSV format with escaped unicode
>> > characters. Without this function it is not possible to decode these
>> > files without external application.
>>
>> I would like some supporting documentation on this.  So far we only have
>> one stackoverflow question, and then this implementation, and they are
>> not even the same format.  My worry is that if there is not precise
>> specification, then people are going to want to add things in the
>> future, and there will be no way to analyze such requests in a
>> principled way.
>>
>>
> I checked this and it is "prefix backslash-u hex" used by Java,
> JavaScript  or RTF -
> https://billposer.org/Software/ListOfRepresentations.html
>
> In some languages (Python), there is decoder "unicode-escape". Java has a
> method escapeJava, for conversion from unicode to ascii. I can imagine so
> these data are from Java systems exported to 8bit strings - so this
> implementation can be accepted as  referential. This format is used by
> https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html
> tool too.
>
> Postgres can decode this format too, and the patch is based on Postgres
> implementation. I just implemented a different interface.
>
> Currently decode function does only text->bytea transformation. Maybe a
> more generic function "decode_text" and "encode_text" for similar cases can
> be better (here we need text->text transformation). But it looks like
> overengineering now.
>
> Maybe we introduce new encoding "ascii" and we can implement new
> conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most
> clean solution. What do you think about it?
>

a better name of new encoding can be "unicode-escape" than "ascii". We use
"to_ascii" function for different use case.

set client_encoding to unicode-escape;
copy tab from xxx;
...

but it doesn't help when only a few columns from the table are in
unicode-escape format.




> Regards
>
> Pavel
>
>
>


Re: proposal: unescape_text function

2020-11-30 Thread Pavel Stehule
po 30. 11. 2020 v 14:14 odesílatel Peter Eisentraut <
peter.eisentr...@enterprisedb.com> napsal:

> On 2020-11-29 18:36, Pavel Stehule wrote:
> >
> > I don't really get the point of this function.  There is AFAICT no
> > function to produce this escaped format, and it's not a recognized
> > interchange format.  So under what circumstances would one need to
> > use this?
> >
> >
> > Some corporate data can be in CSV format with escaped unicode
> > characters. Without this function it is not possible to decode these
> > files without external application.
>
> I would like some supporting documentation on this.  So far we only have
> one stackoverflow question, and then this implementation, and they are
> not even the same format.  My worry is that if there is not precise
> specification, then people are going to want to add things in the
> future, and there will be no way to analyze such requests in a
> principled way.
>
>
I checked this and it is "prefix backslash-u hex" used by Java, JavaScript
or RTF - https://billposer.org/Software/ListOfRepresentations.html

In some languages (Python), there is decoder "unicode-escape". Java has a
method escapeJava, for conversion from unicode to ascii. I can imagine so
these data are from Java systems exported to 8bit strings - so this
implementation can be accepted as  referential. This format is used by
https://docs.oracle.com/javase/8/docs/technotes/tools/unix/native2ascii.html
tool too.

Postgres can decode this format too, and the patch is based on Postgres
implementation. I just implemented a different interface.

Currently decode function does only text->bytea transformation. Maybe a
more generic function "decode_text" and "encode_text" for similar cases can
be better (here we need text->text transformation). But it looks like
overengineering now.

Maybe we introduce new encoding "ascii" and we can implement new
conversions "ascii_to_utf8" and "utf8_to_ascii". It looks like the most
clean solution. What do you think about it?

Regards

Pavel


Re: proposal: unescape_text function

2020-11-30 Thread Peter Eisentraut

On 2020-11-29 18:36, Pavel Stehule wrote:


I don't really get the point of this function.  There is AFAICT no
function to produce this escaped format, and it's not a recognized
interchange format.  So under what circumstances would one need to
use this?


Some corporate data can be in CSV format with escaped unicode 
characters. Without this function it is not possible to decode these 
files without external application.


I would like some supporting documentation on this.  So far we only have 
one stackoverflow question, and then this implementation, and they are 
not even the same format.  My worry is that if there is not precise 
specification, then people are going to want to add things in the 
future, and there will be no way to analyze such requests in a 
principled way.






Re: proposal: unescape_text function

2020-11-29 Thread Pavel Stehule
pá 27. 11. 2020 v 15:37 odesílatel Peter Eisentraut <
peter.eisentr...@enterprisedb.com> napsal:

> On 2020-10-07 11:00, Pavel Stehule wrote:
> > Since the idea originated from unescaping unicode string
> > literals i.e.
> > select unescape('Odpov\u011Bdn\u00E1 osoba');
> >
> > Shouldn't the built-in function support the above syntax as well?
> >
> >
> > good idea. The prefixes u (4 digits) and U (8 digits) are supported
>
> I don't really get the point of this function.  There is AFAICT no
> function to produce this escaped format, and it's not a recognized
> interchange format.  So under what circumstances would one need to use
> this?
>

Some corporate data can be in CSV format with escaped unicode characters.
Without this function it is not possible to decode these files without
external application.

Postgres has support for this conversion, but only for string literals.

CREATE OR REPLACE FUNCTION public.unescape(text, text)
 RETURNS text
 LANGUAGE plpgsql
 AS $function$
 DECLARE result text;
 BEGIN
   EXECUTE format('SELECT U&%s UESCAPE %s',
 quote_literal(replace($1, '\u','^')),
 quote_literal($2)) INTO result;
   RETURN result;
 END;
 $function$


Because unicode is major encoding, I think this conversion should be
supported. There is another question about implementation like in this
patch implemented unicode_unescape function, or with some new conversion.
Using conversion
https://www.postgresql.org/docs/current/sql-createconversion.html  is
probably better, but I am not sure how intuitive it is, and it is hard to
use it (without not nice workarounds) in plpgsql.

I don't expect so Postgres should produce data in unicode escaped format,
but can be useful, if Postgres can do some work with data in special format
of major encoding.

postgres=# select convert_from(E'Odpov\u011Bdn\u00E1 osoba', 'UTF8');
┌─┐
│  convert_from   │
╞═╡
│ Odpovědná osoba │
└─┘
(1 row)

I can do this with bytea, but it is hard to use it with text fields.

I didn't find any way how to do it without ugly steps.

Regards

Pavel


Re: proposal: unescape_text function

2020-11-27 Thread Peter Eisentraut

On 2020-10-07 11:00, Pavel Stehule wrote:

Since the idea originated from unescaping unicode string
literals i.e.
        select unescape('Odpov\u011Bdn\u00E1 osoba');

Shouldn't the built-in function support the above syntax as well?


good idea. The prefixes u (4 digits) and U (8 digits) are supported


I don't really get the point of this function.  There is AFAICT no 
function to produce this escaped format, and it's not a recognized 
interchange format.  So under what circumstances would one need to use this?





Re: proposal: unescape_text function

2020-10-07 Thread Pavel Stehule
st 29. 7. 2020 v 8:18 odesílatel Pavel Stehule 
napsal:

> Hi
>
>
>>
>> Hi Pavel,
>>
>> Since the idea originated from unescaping unicode string literals i.e.
>>select unescape('Odpov\u011Bdn\u00E1 osoba');
>>
>> Shouldn't the built-in function support the above syntax as well?
>>
>
> good idea. The prefixes u (4 digits) and U (8 digits) are supported
>
> Regards
>

rebase

Regards

Pavel


> Pavel
>
>
>> --
>> Asif Rehman
>> Highgo Software (Canada/China/Pakistan)
>> URL : www.highgo.ca
>>
>>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e7cff980dd..93fa238280 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3570,6 +3570,38 @@ repeat('Pg', 4) PgPgPgPg

   
 
+  
+   
+
+ unicode_unescape
+
+unicode_unescape ( string text
+, escape_char text  )
+text
+   
+   
+Evaluate escaped unicode chars (4 or 6 digits), with prefix
+u (4 digits) or with prefix
+U (8 digits) to chars.
+   
+   
+unicode_unescape('\0441\043B\043E\043D')
+слон
+   
+   
+unicode_unescape('d\0061t\+61')
+data
+   
+   
+unicode_unescape('d!0061t!+61', '!')
+data
+   
+   
+unicode_unescape('d\u0061t\U0061')
+data
+   
+  
+
  
 

diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index be86eb37fe..c7f94298c1 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -26,7 +26,6 @@
 #include "parser/parser.h"
 #include "parser/scansup.h"
 
-static bool check_uescapechar(unsigned char escape);
 static char *str_udeescape(const char *str, char escape,
 		   int position, core_yyscan_t yyscanner);
 
@@ -278,44 +277,6 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 	return cur_token;
 }
 
-/* convert hex digit (caller should have verified that) to value */
-static unsigned int
-hexval(unsigned char c)
-{
-	if (c >= '0' && c <= '9')
-		return c - '0';
-	if (c >= 'a' && c <= 'f')
-		return c - 'a' + 0xA;
-	if (c >= 'A' && c <= 'F')
-		return c - 'A' + 0xA;
-	elog(ERROR, "invalid hexadecimal digit");
-	return 0;	/* not reached */
-}
-
-/* is Unicode code point acceptable? */
-static void
-check_unicode_value(pg_wchar c)
-{
-	if (!is_valid_unicode_codepoint(c))
-		ereport(ERROR,
-(errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("invalid Unicode escape value")));
-}
-
-/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
-static bool
-check_uescapechar(unsigned char escape)
-{
-	if (isxdigit(escape)
-		|| escape == '+'
-		|| escape == '\''
-		|| escape == '"'
-		|| scanner_isspace(escape))
-		return false;
-	else
-		return true;
-}
-
 /*
  * Process Unicode escapes in "str", producing a palloc'd plain string
  *
diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c
index d07cbafcee..a2d3f629e9 100644
--- a/src/backend/parser/scansup.c
+++ b/src/backend/parser/scansup.c
@@ -125,3 +125,41 @@ scanner_isspace(char ch)
 		return true;
 	return false;
 }
+
+/* convert hex digit (caller should have verified that) to value */
+unsigned int
+hexval(unsigned char c)
+{
+	if (c >= '0' && c <= '9')
+		return c - '0';
+	if (c >= 'a' && c <= 'f')
+		return c - 'a' + 0xA;
+	if (c >= 'A' && c <= 'F')
+		return c - 'A' + 0xA;
+	elog(ERROR, "invalid hexadecimal digit");
+	return 0;	/* not reached */
+}
+
+/* is Unicode code point acceptable? */
+void
+check_unicode_value(pg_wchar c)
+{
+	if (!is_valid_unicode_codepoint(c))
+		ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid Unicode escape value")));
+}
+
+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
+bool
+check_uescapechar(unsigned char escape)
+{
+	if (isxdigit(escape)
+		|| escape == '+'
+		|| escape == '\''
+		|| escape == '"'
+		|| scanner_isspace(escape))
+		return false;
+	else
+		return true;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index d7bc330541..add6ac3d18 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -6237,3 +6237,256 @@ unicode_is_normalized(PG_FUNCTION_ARGS)
 
 	PG_RETURN_BOOL(result);
 }
+
+/*
+ * First four chars should be hexnum digits
+ */
+static bool
+isxdigit_four(const char *instr)
+{
+	return isxdigit((unsigned char)  instr[0]) &&
+			isxdigit((unsigned char) instr[1]) &&
+			isxdigit((unsigned char) instr[2]) &&
+			isxdigit((unsigned char) instr[3]);
+}
+
+/*
+ * Translate string with hexadecimal digits to number
+ */
+static long int
+hexval_four(const char *instr)
+{
+	return (hexval(instr[0]) << 12) +
+			(hexval(instr[1]) << 8) +
+			(hexval(instr[2]) << 4) +
+			 hexval(instr[3]);
+}
+
+/*
+ * Process Unicode escapes in "str"
+ *
+ * escape: the escape character to use
+ */
+static void
+udeescape(StringInfo str, const char *instr, size_t len, char escape)
+{
+	

Re: proposal: unescape_text function

2020-07-29 Thread Asif Rehman
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi,

The patch looks good to me.

The new status of this patch is: Ready for Committer


Re: proposal: unescape_text function

2020-07-29 Thread Pavel Stehule
Hi


>
> Hi Pavel,
>
> Since the idea originated from unescaping unicode string literals i.e.
>select unescape('Odpov\u011Bdn\u00E1 osoba');
>
> Shouldn't the built-in function support the above syntax as well?
>

good idea. The prefixes u (4 digits) and U (8 digits) are supported

Regards

Pavel


> --
> Asif Rehman
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
>
>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 959f6a1c2f..126d3483e6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3539,6 +3539,38 @@ repeat('Pg', 4) PgPgPgPg

   
 
+  
+   
+
+ unicode_unescape
+
+unicode_unescape ( string text
+, escape_char text  )
+text
+   
+   
+Evaluate escaped unicode chars (4 or 6 digits), with prefix
+u (4 digits) or with prefix
+U (8 digits) to chars.
+   
+   
+unicode_unescape('\0441\043B\043E\043D')
+слон
+   
+   
+unicode_unescape('d\0061t\+61')
+data
+   
+   
+unicode_unescape('d!0061t!+61', '!')
+data
+   
+   
+unicode_unescape('d\u0061t\U0061')
+data
+   
+  
+
  
 

diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index be86eb37fe..c7f94298c1 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -26,7 +26,6 @@
 #include "parser/parser.h"
 #include "parser/scansup.h"
 
-static bool check_uescapechar(unsigned char escape);
 static char *str_udeescape(const char *str, char escape,
 		   int position, core_yyscan_t yyscanner);
 
@@ -278,44 +277,6 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 	return cur_token;
 }
 
-/* convert hex digit (caller should have verified that) to value */
-static unsigned int
-hexval(unsigned char c)
-{
-	if (c >= '0' && c <= '9')
-		return c - '0';
-	if (c >= 'a' && c <= 'f')
-		return c - 'a' + 0xA;
-	if (c >= 'A' && c <= 'F')
-		return c - 'A' + 0xA;
-	elog(ERROR, "invalid hexadecimal digit");
-	return 0;	/* not reached */
-}
-
-/* is Unicode code point acceptable? */
-static void
-check_unicode_value(pg_wchar c)
-{
-	if (!is_valid_unicode_codepoint(c))
-		ereport(ERROR,
-(errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("invalid Unicode escape value")));
-}
-
-/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
-static bool
-check_uescapechar(unsigned char escape)
-{
-	if (isxdigit(escape)
-		|| escape == '+'
-		|| escape == '\''
-		|| escape == '"'
-		|| scanner_isspace(escape))
-		return false;
-	else
-		return true;
-}
-
 /*
  * Process Unicode escapes in "str", producing a palloc'd plain string
  *
diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c
index cac70d5df7..9d3173bc6d 100644
--- a/src/backend/parser/scansup.c
+++ b/src/backend/parser/scansup.c
@@ -218,3 +218,41 @@ scanner_isspace(char ch)
 		return true;
 	return false;
 }
+
+/* convert hex digit (caller should have verified that) to value */
+unsigned int
+hexval(unsigned char c)
+{
+	if (c >= '0' && c <= '9')
+		return c - '0';
+	if (c >= 'a' && c <= 'f')
+		return c - 'a' + 0xA;
+	if (c >= 'A' && c <= 'F')
+		return c - 'A' + 0xA;
+	elog(ERROR, "invalid hexadecimal digit");
+	return 0;	/* not reached */
+}
+
+/* is Unicode code point acceptable? */
+void
+check_unicode_value(pg_wchar c)
+{
+	if (!is_valid_unicode_codepoint(c))
+		ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid Unicode escape value")));
+}
+
+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
+bool
+check_uescapechar(unsigned char escape)
+{
+	if (isxdigit(escape)
+		|| escape == '+'
+		|| escape == '\''
+		|| escape == '"'
+		|| scanner_isspace(escape))
+		return false;
+	else
+		return true;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index df10bfb906..5ca9817708 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -6139,3 +6139,256 @@ unicode_is_normalized(PG_FUNCTION_ARGS)
 
 	PG_RETURN_BOOL(result);
 }
+
+/*
+ * First four chars should be hexnum digits
+ */
+static bool
+isxdigit_four(const char *instr)
+{
+	return isxdigit((unsigned char)  instr[0]) &&
+			isxdigit((unsigned char) instr[1]) &&
+			isxdigit((unsigned char) instr[2]) &&
+			isxdigit((unsigned char) instr[3]);
+}
+
+/*
+ * Translate string with hexadecimal digits to number
+ */
+static long int
+hexval_four(const char *instr)
+{
+	return (hexval(instr[0]) << 12) +
+			(hexval(instr[1]) << 8) +
+			(hexval(instr[2]) << 4) +
+			 hexval(instr[3]);
+}
+
+/*
+ * Process Unicode escapes in "str"
+ *
+ * escape: the escape character to use
+ */
+static void
+udeescape(StringInfo str, const char *instr, size_t len, char escape)
+{
+	pg_wchar	pair_first = 0;
+	char		cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1];
+
+	while (len > 0)
+	{
+		if (instr[0] 

Re: proposal: unescape_text function

2020-07-28 Thread Asif Rehman
On Thu, Jul 2, 2020 at 10:10 PM Pavel Stehule 
wrote:

>
>
> čt 2. 7. 2020 v 17:27 odesílatel Daniel Gustafsson 
> napsal:
>
>> > On 23 Jun 2020, at 11:51, Pavel Stehule 
>> wrote:
>>
>> > I changed the name to more accurately "unicode_unescape". Patch is
>> assigned
>>
>> You've made this function return Oid, where it used to be void.  Was that
>> a
>> copy-paste mistake? Else the code needs fixing as it doesn't return an
>> Oid.
>>
>> +Oid
>> +check_unicode_value(pg_wchar c)
>> +{
>> +   if (!is_valid_unicode_codepoint(c))
>> +   ereport(ERROR,
>> +   (errcode(ERRCODE_SYNTAX_ERROR),
>> +errmsg("invalid Unicode escape value")));
>> +}
>>
>>
> yes, it is my error
>
> I am sending fixed patch
>
> Thank you for check
>
> Pavel
>
> cheers ./daniel
>>
>

Hi Pavel,

Since the idea originated from unescaping unicode string literals i.e.
   select unescape('Odpov\u011Bdn\u00E1 osoba');

Shouldn't the built-in function support the above syntax as well?

--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca


Re: proposal: unescape_text function

2020-07-02 Thread Pavel Stehule
čt 2. 7. 2020 v 17:27 odesílatel Daniel Gustafsson  napsal:

> > On 23 Jun 2020, at 11:51, Pavel Stehule  wrote:
>
> > I changed the name to more accurately "unicode_unescape". Patch is
> assigned
>
> You've made this function return Oid, where it used to be void.  Was that a
> copy-paste mistake? Else the code needs fixing as it doesn't return an Oid.
>
> +Oid
> +check_unicode_value(pg_wchar c)
> +{
> +   if (!is_valid_unicode_codepoint(c))
> +   ereport(ERROR,
> +   (errcode(ERRCODE_SYNTAX_ERROR),
> +errmsg("invalid Unicode escape value")));
> +}
>
>
yes, it is my error

I am sending fixed patch

Thank you for check

Pavel

cheers ./daniel
>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f065856535..6aecdf1641 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3539,6 +3539,24 @@ repeat('Pg', 4) PgPgPgPg

   
 
+  
+   
+
+ unicode_unescape
+
+unicode_unescape ( string text
+, escape_char text  )
+text
+   
+   
+Evaluate escaped unicode chars (4 or 6 digits) to chars.
+   
+   
+unicode_unescape('\0441\043B\043E\043D')
+слон
+   
+  
+
  
 

diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index be86eb37fe..c7f94298c1 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -26,7 +26,6 @@
 #include "parser/parser.h"
 #include "parser/scansup.h"
 
-static bool check_uescapechar(unsigned char escape);
 static char *str_udeescape(const char *str, char escape,
 		   int position, core_yyscan_t yyscanner);
 
@@ -278,44 +277,6 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 	return cur_token;
 }
 
-/* convert hex digit (caller should have verified that) to value */
-static unsigned int
-hexval(unsigned char c)
-{
-	if (c >= '0' && c <= '9')
-		return c - '0';
-	if (c >= 'a' && c <= 'f')
-		return c - 'a' + 0xA;
-	if (c >= 'A' && c <= 'F')
-		return c - 'A' + 0xA;
-	elog(ERROR, "invalid hexadecimal digit");
-	return 0;	/* not reached */
-}
-
-/* is Unicode code point acceptable? */
-static void
-check_unicode_value(pg_wchar c)
-{
-	if (!is_valid_unicode_codepoint(c))
-		ereport(ERROR,
-(errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("invalid Unicode escape value")));
-}
-
-/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
-static bool
-check_uescapechar(unsigned char escape)
-{
-	if (isxdigit(escape)
-		|| escape == '+'
-		|| escape == '\''
-		|| escape == '"'
-		|| scanner_isspace(escape))
-		return false;
-	else
-		return true;
-}
-
 /*
  * Process Unicode escapes in "str", producing a palloc'd plain string
  *
diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c
index cac70d5df7..9d3173bc6d 100644
--- a/src/backend/parser/scansup.c
+++ b/src/backend/parser/scansup.c
@@ -218,3 +218,41 @@ scanner_isspace(char ch)
 		return true;
 	return false;
 }
+
+/* convert hex digit (caller should have verified that) to value */
+unsigned int
+hexval(unsigned char c)
+{
+	if (c >= '0' && c <= '9')
+		return c - '0';
+	if (c >= 'a' && c <= 'f')
+		return c - 'a' + 0xA;
+	if (c >= 'A' && c <= 'F')
+		return c - 'A' + 0xA;
+	elog(ERROR, "invalid hexadecimal digit");
+	return 0;	/* not reached */
+}
+
+/* is Unicode code point acceptable? */
+void
+check_unicode_value(pg_wchar c)
+{
+	if (!is_valid_unicode_codepoint(c))
+		ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid Unicode escape value")));
+}
+
+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
+bool
+check_uescapechar(unsigned char escape)
+{
+	if (isxdigit(escape)
+		|| escape == '+'
+		|| escape == '\''
+		|| escape == '"'
+		|| scanner_isspace(escape))
+		return false;
+	else
+		return true;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index df10bfb906..ce8373c417 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -6139,3 +6139,202 @@ unicode_is_normalized(PG_FUNCTION_ARGS)
 
 	PG_RETURN_BOOL(result);
 }
+
+/*
+ * Process Unicode escapes in "str"
+ *
+ * escape: the escape character to use
+ */
+static void
+udeescape(StringInfo str, const char *instr, size_t len, char escape)
+{
+	pg_wchar	pair_first = 0;
+	char		cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1];
+
+	while (len > 0)
+	{
+		if (instr[0] == escape)
+		{
+			if (len >= 2 &&
+instr[1] == escape)
+			{
+if (pair_first)
+	goto invalid_pair;
+appendStringInfoChar(str, escape);
+instr += 2;
+len -= 2;
+			}
+			else if (len >= 5 &&
+	 isxdigit((unsigned char) instr[1]) &&
+	 isxdigit((unsigned char) instr[2]) &&
+	 isxdigit((unsigned char) instr[3]) &&
+	 isxdigit((unsigned char) instr[4]))
+			{
+pg_wchar	unicode;
+
+unicode = (hexval(instr[1]) << 12) +
+	(hexval(instr[2]) << 8) +
+	(hexval(instr[3]) << 4) +
+	

Re: proposal: unescape_text function

2020-07-02 Thread Daniel Gustafsson
> On 23 Jun 2020, at 11:51, Pavel Stehule  wrote:

> I changed the name to more accurately "unicode_unescape". Patch is assigned

You've made this function return Oid, where it used to be void.  Was that a
copy-paste mistake? Else the code needs fixing as it doesn't return an Oid.

+Oid
+check_unicode_value(pg_wchar c)
+{
+   if (!is_valid_unicode_codepoint(c))
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("invalid Unicode escape value")));
+}

cheers ./daniel




Re: proposal: unescape_text function

2020-06-23 Thread Pavel Stehule
po 22. 6. 2020 v 5:48 odesílatel Pavel Stehule 
napsal:

> Hi
>
> There is one user request for unescape function in core.
>
>
> https://stackoverflow.com/questions/20124393/convert-escaped-unicode-character-back-to-actual-character-in-postgresql/20125412?noredirect=1#comment110502526_20125412
>
> This request is about possibility that we do with string literal via
> functional interface instead string literals only
>
> I wrote plpgsql function, but built in function can be simpler:
>
> CREATE OR REPLACE FUNCTION public.unescape(text, text)
>  RETURNS text
>  LANGUAGE plpgsql
>  AS $function$
>  DECLARE result text;
>  BEGIN
>EXECUTE format('SELECT U&%s UESCAPE %s',
>  quote_literal(replace($1, '\u','^')),
>  quote_literal($2)) INTO result;
>RETURN result;
>  END;
>  $function$
>
> postgres=# select unescape('Odpov\u011Bdn\u00E1 osoba','^');
> unescape -
>  Odpovědná osoba(1 row)
>
> What do you think about this?
>

I changed the name to more accurately "unicode_unescape". Patch is assigned

Regards

Pavel


> Regards
>
> Pavel
>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b7c450ea29..365ea17946 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3533,6 +3533,24 @@ repeat('Pg', 4) PgPgPgPg

   
 
+  
+   
+
+ unicode_unescape
+
+unicode_unescape ( string text
+, escape_char text  )
+text
+   
+   
+Evaluate escaped unicode chars (4 or 6 digits) to chars.
+   
+   
+unicode_unescape('\0441\043B\043E\043D')
+слон
+   
+  
+
  
 

diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index be86eb37fe..c7f94298c1 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -26,7 +26,6 @@
 #include "parser/parser.h"
 #include "parser/scansup.h"
 
-static bool check_uescapechar(unsigned char escape);
 static char *str_udeescape(const char *str, char escape,
 		   int position, core_yyscan_t yyscanner);
 
@@ -278,44 +277,6 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
 	return cur_token;
 }
 
-/* convert hex digit (caller should have verified that) to value */
-static unsigned int
-hexval(unsigned char c)
-{
-	if (c >= '0' && c <= '9')
-		return c - '0';
-	if (c >= 'a' && c <= 'f')
-		return c - 'a' + 0xA;
-	if (c >= 'A' && c <= 'F')
-		return c - 'A' + 0xA;
-	elog(ERROR, "invalid hexadecimal digit");
-	return 0;	/* not reached */
-}
-
-/* is Unicode code point acceptable? */
-static void
-check_unicode_value(pg_wchar c)
-{
-	if (!is_valid_unicode_codepoint(c))
-		ereport(ERROR,
-(errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("invalid Unicode escape value")));
-}
-
-/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
-static bool
-check_uescapechar(unsigned char escape)
-{
-	if (isxdigit(escape)
-		|| escape == '+'
-		|| escape == '\''
-		|| escape == '"'
-		|| scanner_isspace(escape))
-		return false;
-	else
-		return true;
-}
-
 /*
  * Process Unicode escapes in "str", producing a palloc'd plain string
  *
diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c
index 18169ec4f4..5a39edf450 100644
--- a/src/backend/parser/scansup.c
+++ b/src/backend/parser/scansup.c
@@ -228,3 +228,41 @@ scanner_isspace(char ch)
 		return true;
 	return false;
 }
+
+/* convert hex digit (caller should have verified that) to value */
+unsigned int
+hexval(unsigned char c)
+{
+	if (c >= '0' && c <= '9')
+		return c - '0';
+	if (c >= 'a' && c <= 'f')
+		return c - 'a' + 0xA;
+	if (c >= 'A' && c <= 'F')
+		return c - 'A' + 0xA;
+	elog(ERROR, "invalid hexadecimal digit");
+	return 0;	/* not reached */
+}
+
+/* is Unicode code point acceptable? */
+Oid
+check_unicode_value(pg_wchar c)
+{
+	if (!is_valid_unicode_codepoint(c))
+		ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("invalid Unicode escape value")));
+}
+
+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
+bool
+check_uescapechar(unsigned char escape)
+{
+	if (isxdigit(escape)
+		|| escape == '+'
+		|| escape == '\''
+		|| escape == '"'
+		|| scanner_isspace(escape))
+		return false;
+	else
+		return true;
+}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 2eaabd6231..2934a1d9da 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -6139,3 +6139,202 @@ unicode_is_normalized(PG_FUNCTION_ARGS)
 
 	PG_RETURN_BOOL(result);
 }
+
+/*
+ * Process Unicode escapes in "str"
+ *
+ * escape: the escape character to use
+ */
+static void
+udeescape(StringInfo str, const char *instr, size_t len, char escape)
+{
+	pg_wchar	pair_first = 0;
+	char		cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1];
+
+	while (len > 0)
+	{
+		if (instr[0] == escape)
+		{
+			if (len >= 2 &&
+instr[1] == escape)
+			{
+if (pair_first)
+	goto