New submission from mohammad aghanabi:

According to [UAX #29](http://unicode.org/reports/tr29) - unicode word 
boundaries (rule WB5a), an apostrophe includes U+0027 ( ' ) APOSTROPHE and 
U+2019 ( ’ ) RIGHT SINGLE QUOTATION MARK (curly apostrophe).

However regex module only implements U+0027 and the second kind (U+2019) is 
missing:

/* Break between apostrophe and vowels (French, Italian). */
/* WB5a */
if (pos_m1 >= 0 && char_at(state->text, pos_m1) == '\'' &&
    is_unicode_vowel(char_at(state->text, text_pos)))
        return TRUE;


[Source 
code](https://bitbucket.org/mrabarnett/mrab-regex/src/f21447bf288780d8dd9b1633820480484ce8f677/regex_3/regex/_regex.c?at=default&fileviewer=file-view-default#_regex.c-1657)

----------
components: Regular Expressions
messages: 273782
nosy: ezio.melotti, mohammad aghanabi, mrabarnett
priority: normal
severity: normal
status: open
title: Unicode word boundries
type: behavior
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue27878>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to