Re: [GENERAL] endash not a graphic character?
On Sun, Aug 21, 2016 at 14:24:16 -0400, Tom Lanewrote: Unfortunately, these particular characters are U+2013 and U+2014 so you lose. Thanks for saving me some time, as it would have taken me quite a while to figure that out. I'll adjust the constraint so that good strings aren't rejected. Which was my immediate problem. I'm not that worried about bad strings getting added, since the data also gets checked before trying to add it to the database. Obviously there's room for improvement here, but so far nobody's been motivated to work on it. Last discussion about it (AFAIR) was this thread: One thing I would suggest is documenting this limitation under: https://www.postgresql.org/docs/9.6/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP I might have missed it, but I did try reading that section to see if I was doing something wrong before asking on the list. In particular I would expect this limitation to be noted under: 9.7.3.6. Limits and Compatibility -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] endash not a graphic character?
Bruno Wolff IIIwrites: > However I am wondering about my use of [[:graph:]] to match characters > that have glyphs. I was not expecting there to be characters that have > glyphs to not be in the graph class. In the short term I might want to > change the way I am testing that. [ looks into code... ] The [[:foo:]] notations only work up to Unicode code point U+7FF at the moment, per this comment in regc_pg_locale.c: * Decide how many character codes we ought to look through. For C locale * there's no need to go further than 127. Otherwise, if the encoding is * UTF8 go up to 0x7FF, which is a pretty arbitrary cutoff but we cannot * extend it as far as we'd like (say, 0x, the end of the Basic * Multilingual Plane) without creating significant performance issues due * to too many characters being fed through the colormap code. This will * need redesign to fix reasonably, but at least for the moment we have * all common European languages covered. Otherwise (not C, not UTF8) go * up to 255. These limits are interrelated with restrictions discussed * at the head of this file. Unfortunately, these particular characters are U+2013 and U+2014 so you lose. Obviously there's room for improvement here, but so far nobody's been motivated to work on it. Last discussion about it (AFAIR) was this thread: https://www.postgresql.org/message-id/flat/24241.1329347196%40sss.pgh.pa.us I'm not sure if any of the subsequent work on the regex engine would make it any easier to fix than it seemed at the time. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] endash not a graphic character?
On Sun, Aug 21, 2016 at 12:30:21 -0500, Bruno Wolff IIIwrote: I should also try the equivalent test in perl to see if it is more likely tied to the unicode implementation on my system or if it appears to be Postgres specific. It looks like my locale may not be being set the way I expect. I tried testing in perl and initially I got results consistent with Postgres, but when I added code to make sure perl was working in utf-8 mode I started getting the expected results. I would have expected manually adding a collation to the queries would have worked even if the default was not what I expected. So pointers to what I am missing would still be appreciated. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] endash not a graphic character?
On Sun, Aug 21, 2016 at 08:12:23 +1000, rob stonewrote: You can't use (emdash) or (endash)? Or their hex equivalents. See the Unicode chart. By the way, those aren't the correct codes. That only works if your code treats iso-5589-1 code points as windows 1252 code points. That may happen to work in many cases, but isn't a good thing to bet on. (Single byte utf8 codes match iso-8859-1, not windows 1252.) -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] endash not a graphic character?
On Sun, Aug 21, 2016 at 08:12:23 +1000, rob stonewrote: You can't use (emdash) or (endash)? Or their hex equivalents. See the Unicode chart. I am not the source of the data, but I can special case them one way or the other. However I am wondering about my use of [[:graph:]] to match characters that have glyphs. I was not expecting there to be characters that have glyphs to not be in the graph class. In the short term I might want to change the way I am testing that. I should also try the equivalent test in perl to see if it is more likely tied to the unicode implementation on my system or if it appears to be Postgres specific. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] endash not a graphic character?
Hello Bruno, On Sat, 2016-08-20 at 14:04 -0500, Bruno Wolff III wrote: > I was surprised to find endash and emdash were not graphic characters > in > en_US. I'm not sure if this is correct behavior, a bug in postgres or > a > bug in my OS' collation definitions? > > For example: > > Dash: > area=> select '-' ~ '[[:graph:]]' collate "en_US"; > ?column? > -- > t > (1 row) > > Endash: > area=> select '–' ~ '[[:graph:]]' collate "en_US"; > ?column? > -- > f > (1 row) > > > Emdash: > area=> select '—' ~ '[[:graph:]]' collate "en_US"; > ?column? > -- > f > (1 row) > > You can't use (emdash) or (endash)? Or their hex equivalents. See the Unicode chart. HTH, rob -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general