Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-11 Thread Peter Geoghegan
On Thu, Aug 11, 2016 at 4:22 AM, Palle Girgensohn  wrote:
> But in your strxfrm code in PostgreSQL, the keys are cached, and represented
> as int64:s if I remember correctly, so perhaps there is still a benefit
> using the abbreviated keys? More testing is required, I guess...

ICU's ucol_nextSortKeyPart() interface is faster for this, I believe,
and works because you only need the first sizeof(Datum) bytes (4 or 8
bytes). So, you have the right idea about using it (at least for the
abbreviated key stuff), but the second last argument needs to be
sizeof(Datum).

The whole point of the abbreviated key optimization is that you can
avoid pointer chasing during each and every comparison. Your test here
is invalid because it doesn't involved the reuse of the keys. Often,
during a sort, each item has to be compared about 20 times.

I've experimented with ICU, and know it works well with this. You
really need to create a scenario with a real sort, and all the
conditions I describe.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-11 Thread Palle Girgensohn

> 11 aug. 2016 kl. 11:15 skrev Palle Girgensohn :
> 
>> 
>> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan :
>> 
>> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn  
>> wrote:
>>> They've been used for the FreeBSD ports since 2005, and have served us 
>>> well. I have of course updated them regularly. In this latest version, I've 
>>> removed support for other encodings beside UTF-8, mostly since I don't know 
>>> how to test them, but also, I see little point in supporting anything else 
>>> using ICU.
>> 
>> Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
>> is not the release that introduced its use, it did expand it
>> significantly. I think you need to fix this, even though it isn't
>> actually used to sort text at present, since presumably FreeBSD builds
>> of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
>> reasonably trust the ICU equivalent of strxfrm(), so that's a missed
>> opportunity. (See the wiki page on the abbreviated keys issue [1] if
>> you don't know what I'm talking about.)
> 
> My plan was to get it working without TRUST_STRXFRM first, and then add that 
> functinality. I've made some preliminary tests using ICU:s ucol_getSortKey 
> but I will have to test it a bit more. For now, I just expect not to trust 
> strxfrm. It is the first iteration wrt strxfrm, the plan is to use that code 
> base.

Here are some preliminary results running 1 times comparing the same two 
strings in a tight loop.

  ucol_strcollUTF8: -1  0.002448
   strcoll: 1   0.060711
  ucol_strcollIter: -1  0.009221
 direct memcmp: 1   0.000457
 memcpy memcmp: 1   0.001706
memcpy strcoll: 1   0.068425
   nextSortKeyPart: -1  0.041011
ucnv_toUChars + getSortKey: -1  0.050379


correct answer is -1, but since we compare åasdf and äasdf with a Swedish 
locale, memcmp and strcoll fails of course, as espected. Direct memcmp is 5 
times faster than ucol_strcollUTF8 (used in my patch), but sadly the best 
implementation using sort keys with ICU, nextSortKeyPart, is way slower.



startTime = getRealTime();
for ( int i = 0; i < loop; i++) {
result = ucol_strcollUTF8(collator, arg1, len1, arg2, len2, 
&status);
}
endTime = getRealTime();
printf("%30s: %d\t%lf\n", "ucol_strcollUTF8", result, endTime - 
startTime);



vs


int sortkeysize=8;

startTime = getRealTime();
uint8_t key1[sortkeysize], key2[sortkeysize];
uint32_t sState[2], tState[2];
UCharIterator sIter, tIter;

for ( int i = 0; i < loop; i++) {
uiter_setUTF8(&sIter, arg1, len1);
uiter_setUTF8(&tIter, arg2, len2);
sState[0] = 0; sState[1] = 0;
tState[0] = 0; tState[1] = 0;
ucol_nextSortKeyPart(collator, &sIter, sState, key1, 
sortkeysize, &status);
ucol_nextSortKeyPart(collator, &tIter, tState, key2, 
sortkeysize, &status);
result = memcmp (key1, key2, sortkeysize);
}
endTime = getRealTime();
printf("%30s: %d\t%lf\n", "nextSortKeyPart", result, endTime - 
startTime);



But in your strxfrm code in PostgreSQL, the keys are cached, and represented as 
int64:s if I remember correctly, so perhaps there is still a benefit using the 
abbreviated keys? More testing is required, I guess...

Palle





signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-11 Thread Palle Girgensohn

> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan :
> 
> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn  wrote:
>> They've been used for the FreeBSD ports since 2005, and have served us well. 
>> I have of course updated them regularly. In this latest version, I've 
>> removed support for other encodings beside UTF-8, mostly since I don't know 
>> how to test them, but also, I see little point in supporting anything else 
>> using ICU.
> 
> Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
> is not the release that introduced its use, it did expand it
> significantly. I think you need to fix this, even though it isn't
> actually used to sort text at present, since presumably FreeBSD builds
> of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
> reasonably trust the ICU equivalent of strxfrm(), so that's a missed
> opportunity. (See the wiki page on the abbreviated keys issue [1] if
> you don't know what I'm talking about.)

My plan was to get it working without TRUST_STRXFRM first, and then add that 
functinality. I've made some preliminary tests using ICU:s ucol_getSortKey but 
I will have to test it a bit more. For now, I just expect not to trust strxfrm. 
It is the first iteration wrt strxfrm, the plan is to use that code base.

> 
> Shouldn't you really have a strxfrm() wrapper, used across the board,
> including for callers outside of varlena.c? convert_string_datum() has
> been calling strxfrm() for many releases now. These calls are still
> used in FreeBSD builds, I would think, which seems like a bug that is
> not dodged by simply not defining TRUST_STRXFRM. Isn't its assumption
> that that matching the ordering used elsewhere not really hold on
> FreeBSD builds?

I was not aware of convert_string_datum, I will check that, thanks! Using a 
wrapper across the board seems like a good idea for refactoring.

> 
> [1] https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue
> --
> Peter Geoghegan



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-10 Thread Peter Geoghegan
On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn  wrote:
> They've been used for the FreeBSD ports since 2005, and have served us well. 
> I have of course updated them regularly. In this latest version, I've removed 
> support for other encodings beside UTF-8, mostly since I don't know how to 
> test them, but also, I see little point in supporting anything else using ICU.

Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
is not the release that introduced its use, it did expand it
significantly. I think you need to fix this, even though it isn't
actually used to sort text at present, since presumably FreeBSD builds
of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
reasonably trust the ICU equivalent of strxfrm(), so that's a missed
opportunity. (See the wiki page on the abbreviated keys issue [1] if
you don't know what I'm talking about.)

Shouldn't you really have a strxfrm() wrapper, used across the board,
including for callers outside of varlena.c? convert_string_datum() has
been calling strxfrm() for many releases now. These calls are still
used in FreeBSD builds, I would think, which seems like a bug that is
not dodged by simply not defining TRUST_STRXFRM. Isn't its assumption
that that matching the ordering used elsewhere not really hold on
FreeBSD builds?

[1] https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Improved ICU patch - WAS: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-10 Thread Palle Girgensohn

> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian :
> 
> On Thu, Aug  4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
>> Yep, it does. But we've made little to no progress on integration of ICU
>> support and AFAIK nobody's working on it right now.
> 
> Uh, this email from July says Peter Eisentraut will submit it in
> September  :-)
> 
>   
> https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa228892...@2ndquadrant.com

Cool.

I have brushed up my decade+ old patches [1] for ICU, so they now have support 
for COLLATE on columns.


https://github.com/girgen/postgres/


in branches icu/XXX where XXX is master or REL9_X_STABLE.

They've been used for the FreeBSD ports since 2005, and have served us well. I 
have of course updated them regularly. In this latest version, I've removed 
support for other encodings beside UTF-8, mostly since I don't know how to test 
them, but also, I see little point in supporting anything else using ICU.



I have one question for someone with knowledge about Turkish (Devrim?). This is 
the diff from regression tests, when running

$ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8

$ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs"
*** 
/Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out
  2016-08-10 21:09:03.0 +0200
--- 
/Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out
2016-08-10 21:12:53.0 +0200
***
*** 373,379 
  SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";
   false
  ---
!  f
  (1 row)

  SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
--- 373,379 
  SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";
   false
  ---
!  t
  (1 row)

  SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
***
*** 385,391 
  SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";
   true
  --
!  t
  (1 row)

  -- The following actually exercises the selectivity estimation for ~*.
--- 385,391 
  SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";
   true
  --
!  f
  (1 row)

  -- The following actually exercises the selectivity estimation for ~*.

==

The Linux locale behaves differently from ICU for the above (corner ?) cases. 
Any ideas if one is more correct than the other? I seems unclear to me. Perhaps 
it depends on whether the case-insensitive match is done using lower(both) or 
upper(both)? I haven't investigated this yet. @Devrim, is one more correct than 
the other?


As Thomas points out, using ucoll_strcoll it is quick, since no copying is 
needed. I will get some benchmarks soon.

Palle



[1] https://people.freebsd.org/~girgen/postgresql-icu/README.html



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Bruce Momjian
On Thu, Aug  4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
> Yep, it does. But we've made little to no progress on integration of ICU
> support and AFAIK nobody's working on it right now. 

Uh, this email from July says Peter Eisentraut will submit it in
September  :-)


https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa228892...@2ndquadrant.com

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Craig Ringer
On 4 August 2016 at 05:00, Thomas Munro 
wrote:

> On Thu, Aug 4, 2016 at 5:16 AM, Craig Ringer 
> wrote:
> > On 3 August 2016 at 22:54, Álvaro Hernández Tortosa 
> wrote:
> >> What would it take to support it? Isn't the varlena header
> propagated
> >> everywhere, which could help infer the real length of the string? Any
> >> pointers or suggestions would be welcome.
> >
> >
> > One of the bigger pain points is that our interaction with C library
> > collation routines for sorting uses NULL-terminated C strings.  strcoll,
> > strxfrm, etc.
>
> That particular bit of the problem would go away if this ever happened:
>
> https://wiki.postgresql.org/wiki/Todo:ICU
>
> ucoll_strcoll takes explicit lengths (though optionally accepts -1 for
> null terminated mode).
>
>
> http://userguide.icu-project.org/strings#TOC-Using-C-Strings:-NUL-Terminated-vs.-Length-Parameters
>

Yep, it does. But we've made little to no progress on integration of ICU
support and AFAIK nobody's working on it right now.

I wonder how MySQL implements their collation and encoding support?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa



On 03/08/16 21:42, Geoff Winkless wrote:

On 3 August 2016 at 20:36, Álvaro Hernández Tortosa  wrote:

 Isn't the correct syntax something like:

select E'\uc080', U&'\c080';

?

 It is a single character, 16 bit unicode sequence (see
https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html).

No, what you've done there is created the three-byte utf8 sequence \xec8280

# select U&'\c080'::bytea;
   bytea
--
  \xec8280

It's not a UCS2 c080, it's utf8 c080.

Geoff


Yes, you're absolutely right ^_^

Álvaro


--

Álvaro Hernández Tortosa


---
8Kdata



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Thomas Munro
On Thu, Aug 4, 2016 at 5:16 AM, Craig Ringer  wrote:
> On 3 August 2016 at 22:54, Álvaro Hernández Tortosa  wrote:
>> What would it take to support it? Isn't the varlena header propagated
>> everywhere, which could help infer the real length of the string? Any
>> pointers or suggestions would be welcome.
>
>
> One of the bigger pain points is that our interaction with C library
> collation routines for sorting uses NULL-terminated C strings.  strcoll,
> strxfrm, etc.

That particular bit of the problem would go away if this ever happened:

https://wiki.postgresql.org/wiki/Todo:ICU

ucoll_strcoll takes explicit lengths (though optionally accepts -1 for
null terminated mode).

http://userguide.icu-project.org/strings#TOC-Using-C-Strings:-NUL-Terminated-vs.-Length-Parameters

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Geoff Winkless
On 3 August 2016 at 20:36, Álvaro Hernández Tortosa  wrote:
> Isn't the correct syntax something like:
>
> select E'\uc080', U&'\c080';
>
> ?
>
> It is a single character, 16 bit unicode sequence (see
> https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html).

No, what you've done there is created the three-byte utf8 sequence \xec8280

# select U&'\c080'::bytea;
  bytea
--
 \xec8280

It's not a UCS2 c080, it's utf8 c080.

Geoff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Tom Lane
=?UTF-8?Q?=c3=81lvaro_Hern=c3=a1ndez_Tortosa?=  writes:
>  According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout 
> the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code 
> point. In short, the \u00 nul is represented (overlong encoding) by the 
> two-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8 
> so should not appear in an otherwise valid UTF-8 string. Yet they are 
> accepted by Postgres (like if Postgres would support Modified UTF-8 
> intentionally).

Really?  It sure looks to me like pg_utf8_islegal() would reject this.

We could hack it to allow the case, no doubt, but I concur with Peter's
concern that we'd have trouble with OS-level code that is strict about
what UTF8 allows.  glibc, for example, is known to do very strange things
with strings that it thinks are invalid in the active encoding.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa



On 03/08/16 21:31, Geoff Winkless wrote:

On 3 August 2016 at 20:13, Álvaro Hernández Tortosa  wrote:

Yet they are accepted by Postgres
(like if Postgres would support Modified UTF-8 intentionally). The caracter
in psql does not render as a nul but as this symbol: "삀".

Not accepted as valid utf8:

# select E'\xc0\x80';
ERROR:  invalid byte sequence for encoding "UTF8": 0xc0 0x80

You would need a "modified utf8" encoding, I think.

Geoff


Isn't the correct syntax something like:

select E'\uc080', U&'\c080';

?

It is a single character, 16 bit unicode sequence (see 
https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html).



Álvaro

--

Álvaro Hernández Tortosa


---
8Kdata



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Geoff Winkless
On 3 August 2016 at 20:13, Álvaro Hernández Tortosa  wrote:
> Yet they are accepted by Postgres
> (like if Postgres would support Modified UTF-8 intentionally). The caracter
> in psql does not render as a nul but as this symbol: "삀".

Not accepted as valid utf8:

# select E'\xc0\x80';
ERROR:  invalid byte sequence for encoding "UTF8": 0xc0 0x80

You would need a "modified utf8" encoding, I think.

Geoff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa



On 03/08/16 20:14, Álvaro Hernández Tortosa wrote:



On 03/08/16 17:47, Kevin Grittner wrote:
On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa 
 wrote:



 What would it take to support it?

Would it be of any value to support "Modified UTF-8"?

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8



That's nice, but I don't think so.

The problem is that you cannot predict how people would send you 
data, like when importing from other databases. I guess it may work if 
Postgres would implement such UTF-8 variant and also the drivers, but 
that would still require an encoding conversion (i.e., parsing every 
string) to change the 0x00, which seems like a serious performance hit.


It could be worse than nothing, though!

Thanks,

Álvaro



It may indeed work.

According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout 
the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code 
point. In short, the \u00 nul is represented (overlong encoding) by the 
two-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8 
so should not appear in an otherwise valid UTF-8 string. Yet they are 
accepted by Postgres (like if Postgres would support Modified UTF-8 
intentionally). The caracter in psql does not render as a nul but as 
this symbol: "삀".


Given that this works, the process would look like this:

- Parse all input data looking for bytes with hex value 0x00. If they 
appear in the string, they are the null byte.

- Replace that byte with the two bytes 0xc080.
- Reverse the operation when reading.

This is OK but of course a performance hit (searching for 0x00 and 
then augmenting the byte[] or whatever data structure to account for the 
extra byte). A little bit of a PITA, but I guess better than fixing it 
all :)



Álvaro


--

Álvaro Hernández Tortosa


---
8Kdata



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa



On 03/08/16 18:35, Geoff Winkless wrote:

On 3 August 2016 at 15:54, Álvaro Hernández Tortosa  wrote:

 Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're
strictly non-compliant.

It's perhaps worth mentioning that 0x00 is valid ASCII too, and
PostgreSQL has never stored that either.


Then yes, it could also be a problem. But as of today I believe 
solving the problem for UTF-8 would solve the great majority of this 
embedded NUL problems.


Álvaro


--

Álvaro Hernández Tortosa


---
8Kdata



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa



On 03/08/16 17:47, Kevin Grittner wrote:

On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa  
wrote:


 What would it take to support it?

Would it be of any value to support "Modified UTF-8"?

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8



That's nice, but I don't think so.

The problem is that you cannot predict how people would send you 
data, like when importing from other databases. I guess it may work if 
Postgres would implement such UTF-8 variant and also the drivers, but 
that would still require an encoding conversion (i.e., parsing every 
string) to change the 0x00, which seems like a serious performance hit.


It could be worse than nothing, though!

Thanks,

Álvaro

--

Álvaro Hernández Tortosa


---
8Kdata



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa



On 03/08/16 17:23, Tom Lane wrote:

=?UTF-8?Q?=c3=81lvaro_Hern=c3=a1ndez_Tortosa?=  writes:

  As has been previously discussed (see
https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl
for instance) varlena fields cannot accept the literal 0x00 value.

Yup.


  What would it take to support it?

One key reason why that's hard is that datatype input and output
functions use nul-terminated C strings as the representation of the
text form of any datatype.  We can't readily change that API without
breaking huge amounts of code, much of it not under the core project's
control.

There may be other places where nul-terminated strings would be a hazard
(mumble fgets mumble), but offhand that API seems like the major problem
so far as the backend is concerned.

There would be a slew of client-side problems as well.  For example this
would assuredly break psql and pg_dump, along with every other client that
supposes that it can treat PQgetvalue() as returning a nul-terminated
string.  This end of it would possibly be even worse than fixing the
backend, because so little of the affected code is under our control.

In short, the problem is not with having an embedded nul in a stored
text value.  The problem is the reams of code that suppose that the
text representation of any data value is a nul-terminated C string.

regards, tom lane


Wow. That seems like a daunting task.

I guess, then, than even implementing a new datatype based on bytea 
but that would use the text IO functions to show up as text (not 
send/recv) would neither work, right?


Thanks for the input,

Álvaro


--

Álvaro Hernández Tortosa


---
8Kdata



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Craig Ringer
On 3 August 2016 at 22:54, Álvaro Hernández Tortosa  wrote:

>
> Hi list.
>
> As has been previously discussed (see
> https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl
> for instance) varlena fields cannot accept the literal 0x00 value. Sure,
> you can use bytea, but this hardly a good solution. The problem seems to be
> hitting some use cases, like:
>
> - People migrating data from other databases (apart from PostgreSQL, I
> don't know of any other database which suffers the same problem).
> - People using drivers which use UTF-8 or equivalent encodings by default
> (Java for example)
>
> Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're
> strictly non-compliant. And given the general Postgres policy regarding
> standards compliance and the people being hit by this, I think it should be
> addressed. Specially since all the usual fixes are a real PITA (re-parsing,
> re-generating strings, which is very expensive, or dropping data).
>
> What would it take to support it? Isn't the varlena header propagated
> everywhere, which could help infer the real length of the string? Any
> pointers or suggestions would be welcome.


One of the bigger pain points is that our interaction with C library
collation routines for sorting uses NULL-terminated C strings.  strcoll,
strxfrm, etc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Geoff Winkless
On 3 August 2016 at 15:54, Álvaro Hernández Tortosa  wrote:
> Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're
> strictly non-compliant.

It's perhaps worth mentioning that 0x00 is valid ASCII too, and
PostgreSQL has never stored that either.

If you want to start quoting standards, there is in fact specific
mention in the ANSI spec of null terminators in passing strings to
host languages, so if postgresql stored NULs in that way we would end
up with parameters that we couldn't pass to UDFs in a
standards-compliant way.

Geoff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Peter Eisentraut
On 8/3/16 11:47 AM, Kevin Grittner wrote:
> On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa  
> wrote:
> 
>> What would it take to support it?
> 
> Would it be of any value to support "Modified UTF-8"?
> 
> https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

Will this work with OS libraries?


-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Kevin Grittner
On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa  
wrote:

> What would it take to support it?

Would it be of any value to support "Modified UTF-8"?

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

-- 
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Tom Lane
=?UTF-8?Q?=c3=81lvaro_Hern=c3=a1ndez_Tortosa?=  writes:
>  As has been previously discussed (see 
> https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl
>  
> for instance) varlena fields cannot accept the literal 0x00 value.

Yup.

>  What would it take to support it?

One key reason why that's hard is that datatype input and output
functions use nul-terminated C strings as the representation of the
text form of any datatype.  We can't readily change that API without
breaking huge amounts of code, much of it not under the core project's
control.

There may be other places where nul-terminated strings would be a hazard
(mumble fgets mumble), but offhand that API seems like the major problem
so far as the backend is concerned.

There would be a slew of client-side problems as well.  For example this
would assuredly break psql and pg_dump, along with every other client that
supposes that it can treat PQgetvalue() as returning a nul-terminated
string.  This end of it would possibly be even worse than fixing the
backend, because so little of the affected code is under our control.

In short, the problem is not with having an embedded nul in a stored
text value.  The problem is the reams of code that suppose that the
text representation of any data value is a nul-terminated C string.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Implementing full UTF-8 support (aka supporting 0x00)

2016-08-03 Thread Álvaro Hernández Tortosa


Hi list.

As has been previously discussed (see 
https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl 
for instance) varlena fields cannot accept the literal 0x00 value. Sure, 
you can use bytea, but this hardly a good solution. The problem seems to 
be hitting some use cases, like:


- People migrating data from other databases (apart from PostgreSQL, I 
don't know of any other database which suffers the same problem).
- People using drivers which use UTF-8 or equivalent encodings by 
default (Java for example)


Given that 0x00 is a perfectly legal UTF-8 character, I conclude 
we're strictly non-compliant. And given the general Postgres policy 
regarding standards compliance and the people being hit by this, I think 
it should be addressed. Specially since all the usual fixes are a real 
PITA (re-parsing, re-generating strings, which is very expensive, or 
dropping data).


What would it take to support it? Isn't the varlena header 
propagated everywhere, which could help infer the real length of the 
string? Any pointers or suggestions would be welcome.


Thanks,

Álvaro


--

Álvaro Hernández Tortosa


---
8Kdata



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers