[issue1693050] \w not helpful for non-Roman scripts

2018-03-14 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

Whatever I may have said before, I favor supporting the Unicode standard for 
\w, which is related to the standard for identifiers.

This is one of 2 issues about \w being defined too narrowly.  I am somewhat 
arbitrarily closing this as a duplicate of #12731 (fewer digits ;-).

There are 3 issues about tokenize.tokenize failing on valid identifiers, 
defined as \w sequences whose first char is an identifier itself (and therefore 
a start char).  In msg313814 of #32987, Serhiy indicates which start and 
continue identifier characters are matched by \W for re and regex.  I am 
leaving #24194 open as the tokenizer name issue.

--
resolution:  -> duplicate
stage:  -> resolved
status: open -> closed
superseder:  -> tokenize yield an ERRORTOKEN if an identifier uses 
Other_ID_Start or Other_ID_Continue

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2014-02-03 Thread Mark Lawrence

Changes by Mark Lawrence :


--
nosy:  -BreamoreBoy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2013-05-29 Thread STINNER Victor

STINNER Victor added the comment:

Let see Modules/_sre.c:

#define SRE_UNI_IS_ALNUM(ch) Py_UNICODE_ISALNUM(ch)
#define SRE_UNI_IS_WORD(ch) (SRE_UNI_IS_ALNUM(ch) || (ch) == '_')

>>> [ch.isalpha() for ch in '\u0939\u093f\u0928\u094d\u0926\u0940']
[True, False, True, False, True, False]
>>> import unicodedata
>>> [unicodedata.category(ch) for ch in '\u0939\u093f\u0928\u094d\u0926\u0940']
['Lo', 'Mc', 'Lo', 'Mn', 'Lo', 'Mc']

So the matching ends at U+093f because its category is a "spacing combining" 
(Mc), which is part of the Mark category, where the re module expects an 
alphanumeric character.

msg76557:

"""
Unicode TR#18 defines \w as a shorthand for

\p{alpha}
\p{gc=Mark}
\p{digit}
\p{gc=Connector_Punctuation}
"""

So if we want to respect this standard, the re module needs to be modified to 
accept other Unicode categories.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2013-05-29 Thread Matthew Barnett

Matthew Barnett added the comment:

UTF-16 has nothing to do with it, that's just an encoding (a pair of them 
actually, UTF-16LE and UTF-16BE).

And I don't know why you thought I was using findall in msg190100 when the 
examples were using match! :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2013-05-29 Thread Jeffrey C. Jacobs

Jeffrey C. Jacobs added the comment:

Thanks Matthew and sorry to put you through more work; I just wanted to verify 
exactly which unicode (UTF-16 I take it) were being used to verify if the 
UNICODE standard expected them to be treated as unique words or single letters 
within a word.  Sanskrit is an alphabet, not an ideograph so each symbol is 
considered a letter.  So I believe your implementation is correct and yes, you 
are right, re is at fault.  There are just accenting characters and letters in 
that sequence so they should be interpreted as a single word of 6 letters, as 
you determine, and not one of the first letter.  Mind you, I misinterpreted 
msg190100 in that I thought you were using findall in which case the answer 
should be 1, but as far as length of extraction, yes, 6, I totally agree.  
Sorry for the misunderstanding.  http://www.unicode.org/charts/PDF/U0900.pdf 
contains the code chart for Hindi.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2013-05-29 Thread Matthew Barnett

Matthew Barnett added the comment:

You could've obtained it from msg76556 or msg190100:

>>> print(ascii('हिन्दी'))
'\u0939\u093f\u0928\u094d\u0926\u0940'
>>> import re, regex
>>> print(ascii(re.match(r"\w+", 
>>> '\u0939\u093f\u0928\u094d\u0926\u0940').group()))
'\u0939'
>>> print(ascii(regex.match(r"\w+", 
>>> '\u0939\u093f\u0928\u094d\u0926\u0940').group()))
'\u0939\u093f\u0928\u094d\u0926\u0940'

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2013-05-28 Thread Jeffrey C. Jacobs

Jeffrey C. Jacobs added the comment:

Maybe you could show us the byte-for-byte hex of the string you're testing so 
we can examine if it's really a code point intending word boundary or just a 
code point for the sake of beginning a new character.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2013-05-28 Thread Matthew Barnett

Matthew Barnett added the comment:

I'm not sure what you're saying.

The re module in Python 3.3 matches only the first codepoint, treating the 
second codepoint as not part of a word, whereas the regex module matches all 6 
codepoints, treating them all as part of a single word.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2013-05-28 Thread Jeffrey C. Jacobs

Jeffrey C. Jacobs added the comment:

Matthew, I think that is considered a single word in Sanscrit or Thai so Python 
3.x is correct.  In this case you've written the Sanscrit word for Hindi.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2013-05-26 Thread Terry J. Reedy

Changes by Terry J. Reedy :


--
versions: +Python 3.3, Python 3.4 -Python 3.1

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2013-05-26 Thread Matthew Barnett

Matthew Barnett added the comment:

I had to check what re does in Python 3.3:

>>> print(len(re.match(r'\w+', 'हिन्दी').group()))
1

Regex does this:

>>> print(len(regex.match(r'\w+', 'हिन्दी').group()))
6

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2013-05-26 Thread Mark Lawrence

Mark Lawrence added the comment:

Am I correct in saying that this must stay open as it targets the re module but 
as given in msg81221 is fixed in the new regex module?

--
nosy: +BreamoreBoy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2010-03-30 Thread Shashwat Anand

Changes by Shashwat Anand :


--
nosy: +l0nwlf

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2010-03-05 Thread STINNER Victor

Changes by STINNER Victor :


--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2009-05-12 Thread Ezio Melotti

Changes by Ezio Melotti :


--
nosy: +ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2009-02-05 Thread Matthew Barnett

Matthew Barnett  added the comment:

In issue #2636 I'm using the following:

Alpha is Ll, Lo, Lt, Lu.
Digit is Nd.
Word is Ll, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc.

These are what are specified at
http://www.regular-expressions.info/posixbrackets.html

--
nosy: +mrabarnett

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2008-11-28 Thread Martin v. Löwis

Martin v. Löwis <[EMAIL PROTECTED]> added the comment:

Unicode TR#18 defines \w as a shorthand for

\p{alpha}
\p{gc=Mark}
\p{digit}
\p{gc=Connector_Punctuation}

which would include all marks. We should recursively check whether we
follow the recommendation (e.g. \p{alpha} refers to all character having
the Alphabetic derived core property, which is Lu+Ll+Lt+Lm+Lo+Nl +
Other_Alphabetic, where Other_Alphabetic is a selected list of
additional character - all from Mn/Mc)

--
nosy: +loewis

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2008-11-28 Thread Terry J. Reedy

Terry J. Reedy <[EMAIL PROTECTED]> added the comment:

Vowel 'marks' are condensed vowel characters and are very much part of
words and do not separate words.  Python3 properly includes Mn and Mc as
identifier characters.

http://docs.python.org/dev/3.0/reference/lexical_analysis.html#identifiers-and-keywords

For instance, the word 'hindi' has 3 consonants 'h', 'n', 'd', 2 vowels
'i' and 'ii' (long i) following 'h' and 'd', and a null vowel (virama)
after 'n'. [The null vowel is needed because no vowel mark indicates the
default vowel short a.  So without it, the word would be hinadii.]
The difference between the devanagari vowel characters, used at the
beginning of words, and the vowel marks, used thereafter, is purely
graphical and not phonological.  In short, in the sanskrit family,
word = syllable+
syllable = vowel | consonant + vowel mark

From a clp post asking why re does not see hindi as a word:

हिन्दी
 ह DEVANAGARI LETTER HA (Lo)
 ि DEVANAGARI VOWEL SIGN I (Mc)
 न DEVANAGARI LETTER NA (Lo)
 ् DEVANAGARI SIGN VIRAMA (Mn)
 द DEVANAGARI LETTER DA (Lo)
 ी DEVANAGARI VOWEL SIGN II (Mc)

.isapha and possibly other unicode methods need fixing also
>>> 'हिन्दी'.isalpha()#2.x and 3.0
False

--
nosy: +tjreedy
versions: +Python 3.1

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2008-09-28 Thread Jeffrey C. Jacobs

Changes by Jeffrey C. Jacobs <[EMAIL PROTECTED]>:


--
nosy: +timehorse
versions: +Python 2.7 -Python 2.4

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1693050] \w not helpful for non-Roman scripts

2008-04-24 Thread Russ Cox

Changes by Russ Cox <[EMAIL PROTECTED]>:


--
nosy: +rsc

_
Tracker <[EMAIL PROTECTED]>

_
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com