Re: [GENERAL] endash not a graphic character?

2016-08-21 Thread Bruno Wolff III

On Sun, Aug 21, 2016 at 14:24:16 -0400,
 Tom Lane  wrote:


Unfortunately, these particular characters are U+2013 and U+2014 so you
lose.


Thanks for saving me some time, as it would have taken me quite a while 
to figure that out.


I'll adjust the constraint so that good strings aren't rejected. Which 
was my immediate problem. I'm not that worried about bad strings getting 
added, since the data also gets checked before trying to add it to 
the database.



Obviously there's room for improvement here, but so far nobody's been
motivated to work on it.  Last discussion about it (AFAIR) was this
thread:


One thing I would suggest is documenting this limitation under: 
https://www.postgresql.org/docs/9.6/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP


I might have missed it, but I did try reading that section to see if I was 
doing something wrong before asking on the list. In particular I would 
expect this limitation to be noted under:

9.7.3.6. Limits and Compatibility


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] endash not a graphic character?

2016-08-21 Thread Tom Lane
Bruno Wolff III  writes:
> However I am wondering about my use of [[:graph:]] to match characters 
> that have glyphs. I was not expecting there to be characters that have 
> glyphs to not be in the graph class. In the short term I might want to 
> change the way I am testing that.

[ looks into code... ]  The [[:foo:]] notations only work up to Unicode
code point U+7FF at the moment, per this comment in regc_pg_locale.c:

 * Decide how many character codes we ought to look through.  For C locale
 * there's no need to go further than 127.  Otherwise, if the encoding is
 * UTF8 go up to 0x7FF, which is a pretty arbitrary cutoff but we cannot
 * extend it as far as we'd like (say, 0x, the end of the Basic
 * Multilingual Plane) without creating significant performance issues due
 * to too many characters being fed through the colormap code.  This will
 * need redesign to fix reasonably, but at least for the moment we have
 * all common European languages covered.  Otherwise (not C, not UTF8) go
 * up to 255.  These limits are interrelated with restrictions discussed
 * at the head of this file.

Unfortunately, these particular characters are U+2013 and U+2014 so you
lose.

Obviously there's room for improvement here, but so far nobody's been
motivated to work on it.  Last discussion about it (AFAIR) was this
thread:

https://www.postgresql.org/message-id/flat/24241.1329347196%40sss.pgh.pa.us

I'm not sure if any of the subsequent work on the regex engine would
make it any easier to fix than it seemed at the time.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] endash not a graphic character?

2016-08-21 Thread Bruno Wolff III

On Sun, Aug 21, 2016 at 12:30:21 -0500,
 Bruno Wolff III  wrote:


I should also try the equivalent test in perl to see if it is more 
likely tied to the unicode implementation on my system or if it 
appears to be Postgres specific.


It looks like my locale may not be being set the way I expect. I tried 
testing in perl and initially I got results consistent with Postgres, 
but when I added code to make sure perl was working in utf-8 mode I 
started getting the expected results.


I would have expected manually adding a collation to the queries would 
have worked even if the default was not what I expected. So pointers 
to what I am missing would still be appreciated.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] endash not a graphic character?

2016-08-21 Thread Bruno Wolff III

On Sun, Aug 21, 2016 at 08:12:23 +1000,
 rob stone  wrote:


You can't use  (emdash) or  (endash)?
Or their hex equivalents. See the Unicode chart.


By the way, those aren't the correct codes. That only works if your 
code treats iso-5589-1 code points as windows 1252 code points. That 
may happen to work in many cases, but isn't a good thing to bet on.

(Single byte utf8 codes match iso-8859-1, not windows 1252.)


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] endash not a graphic character?

2016-08-21 Thread Bruno Wolff III

On Sun, Aug 21, 2016 at 08:12:23 +1000,
 rob stone  wrote:


You can't use  (emdash) or  (endash)?
Or their hex equivalents. See the Unicode chart.


I am not the source of the data, but I can special case them one way 
or the other.


However I am wondering about my use of [[:graph:]] to match characters 
that have glyphs. I was not expecting there to be characters that have 
glyphs to not be in the graph class. In the short term I might want to 
change the way I am testing that.


I should also try the equivalent test in perl to see if it is more likely 
tied to the unicode implementation on my system or if it appears to be 
Postgres specific.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] endash not a graphic character?

2016-08-20 Thread rob stone
Hello Bruno,
On Sat, 2016-08-20 at 14:04 -0500, Bruno Wolff III wrote:
> I was surprised to find endash and emdash were not graphic characters
> in 
> en_US. I'm not sure if this is correct behavior, a bug in postgres or
> a 
> bug in my OS' collation definitions?
> 
> For example:
> 
> Dash:
> area=> select '-' ~ '[[:graph:]]' collate "en_US";
>  ?column? 
> --
>  t
> (1 row)
> 
> Endash:
> area=> select '–' ~ '[[:graph:]]' collate "en_US";
>  ?column? 
> --
>  f
> (1 row)
> 
> 
> Emdash:
> area=> select '—' ~ '[[:graph:]]' collate "en_US";
>  ?column? 
> --
>  f
> (1 row)
> 
> 



You can't use  (emdash) or  (endash)?
Or their hex equivalents. See the Unicode chart.

HTH,
rob


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general