Re: Unicode normalization SQL functions

2020-04-02 Thread John Naylor
On Thu, Apr 2, 2020 at 3:51 PM Peter Eisentraut wrote: > > On 2020-03-26 18:41, John Naylor wrote: > > We don't have a trie implementation in Postgres, but we do have a > > perfect hash implementation. Doing that would bring the tables back to > > 64 bits per entry, but would likely be noticeably

Re: Unicode normalization SQL functions

2020-04-02 Thread Peter Eisentraut
On 2020-03-26 18:41, John Naylor wrote: We don't have a trie implementation in Postgres, but we do have a perfect hash implementation. Doing that would bring the tables back to 64 bits per entry, but would likely be noticeably faster than binary search. Since v4 has left out the biggest tables

Re: Unicode normalization SQL functions

2020-04-02 Thread Peter Eisentraut
On 2020-03-26 08:25, Peter Eisentraut wrote: On 2020-03-24 10:20, Peter Eisentraut wrote: Now I have some concerns about the size of the new table in unicode_normprops_table.h, and the resulting binary size. At the very least, we should probably make that #ifndef FRONTEND or something like

Re: Unicode normalization SQL functions

2020-03-27 Thread John Naylor
I wrote: > > Regression tests pass, but I haven't measured performance yet. Using a test similar to one upthread: select count(*) from (select md5(i::text) as t from generate_series(1,10) as i) s where t is nfc normalized ; I get (median of three) v4 419ms v5 310ms with binary size v4

Re: Unicode normalization SQL functions

2020-03-24 Thread Peter Eisentraut
On 2020-03-23 17:26, Daniel Verite wrote: Peter Eisentraut wrote: What is that status of this patch set? I think we have nailed down the behavior, but there were some concerns about certain performance characteristics. Do people feel that those are required to be addressed in this

Re: Unicode normalization SQL functions

2020-03-23 Thread Daniel Verite
Peter Eisentraut wrote: > What is that status of this patch set? I think we have nailed down the > behavior, but there were some concerns about certain performance > characteristics. Do people feel that those are required to be addressed > in this cycle? Not finding any other issue

Re: Unicode normalization SQL functions

2020-03-19 Thread Andreas Karlsson
On 3/19/20 3:41 PM, Peter Eisentraut wrote: What is that status of this patch set?  I think we have nailed down the behavior, but there were some concerns about certain performance characteristics.  Do people feel that those are required to be addressed in this cycle? Personally I would

Re: Unicode normalization SQL functions

2020-03-19 Thread Peter Eisentraut
What is that status of this patch set? I think we have nailed down the behavior, but there were some concerns about certain performance characteristics. Do people feel that those are required to be addressed in this cycle? -- Peter Eisentraut http://www.2ndQuadrant.com/

Re: Unicode normalization SQL functions

2020-02-19 Thread Peter Eisentraut
On 2020-02-17 20:08, Daniel Verite wrote: Concerning execution speed, there's an excessive CPU usage when normalizing into NFC or NFKC. Looking at pre-existing code, it looks like recompose_code() in unicode_norm.c looping over the UnicodeDecompMain array might be very costly. Yes, this is a

Re: Unicode normalization SQL functions

2020-02-19 Thread Peter Eisentraut
On 2020-02-17 20:14, Daniel Verite wrote: The comment in full says: /* * unicode_normalize - Normalize a Unicode string to the specified form. * * The

Re: Unicode normalization SQL functions

2020-02-17 Thread Daniel Verite
One nitpick: Around this hunk: - * unicode_normalize_kc - Normalize a Unicode string to NFKC form. + * unicode_normalize - Normalize a Unicode string to the specified form. * * The input is a 0-terminated array of codepoints. * @@ -304,8 +306,10 @@ decompose_code(pg_wchar code, pg_wchar

Re: Unicode normalization SQL functions

2020-02-17 Thread Daniel Verite
Hi, I've checked the v3 patch against the results of the normalization done by ICU [1] on my test data again, and they're identical (as they were with v1; v2 had the bug discussed upthread, now fixed). Concerning execution speed, there's an excessive CPU usage when normalizing into NFC or

Re: Unicode normalization SQL functions

2020-02-17 Thread Peter Eisentraut
On 2020-02-13 01:23, Andreas Karlsson wrote: A potential optimization would be to merge utf8_to_unicode() and pg_utf_mblen() into one function in unicode_normalize_func() since utf8_to_unicode() already knows length of the character. Probably not worth it though. This would also require

Re: Unicode normalization SQL functions

2020-02-12 Thread Michael Paquier
On Thu, Feb 13, 2020 at 01:23:41AM +0100, Andreas Karlsson wrote: > On 1/28/20 9:21 PM, Peter Eisentraut wrote: >> You're right, this didn't make any sense.  Here is a new patch set with >> that fixed. > > Thanks for this patch. This is a feature which has been on my personal todo > list for a

Re: Unicode normalization SQL functions

2020-02-12 Thread Andreas Karlsson
On 1/28/20 9:21 PM, Peter Eisentraut wrote: You're right, this didn't make any sense.  Here is a new patch set with that fixed. Thanks for this patch. This is a feature which has been on my personal todo list for a while and something which I have wished to have a couple of times. I took a

Re: Unicode normalization SQL functions

2020-01-28 Thread Daniel Verite
Peter Eisentraut wrote: > Here is an updated patch set that now also implements the "quick check" > algorithm from UTR #15 for making IS NORMALIZED very fast in many cases, > which I had mentioned earlier in the thread. I found a bug in unicode_is_normalized_quickcheck() which is

Re: Unicode normalization SQL functions

2020-01-09 Thread Peter Eisentraut
On 2020-01-06 17:00, Daniel Verite wrote: Peter Eisentraut wrote: Also, there is a way to optimize the "is normalized" test for common cases, described in UTR #15. For that we'll need an additional data file from Unicode. In order to simplify that, I would like my patch "Add support

Re: Unicode normalization SQL functions

2020-01-06 Thread Daniel Verite
Peter Eisentraut wrote: > Also, there is a way to optimize the "is normalized" test for common > cases, described in UTR #15. For that we'll need an additional data > file from Unicode. In order to simplify that, I would like my patch > "Add support for automatically updating Unicode