On 25/05/2006 5:43 AM, [EMAIL PROTECTED] wrote: > I'm trying to make a unicode friendly regexp to grab sentences > reasonably reliably for as many unicode languages as possible, focusing > on european languages first, hence it'd be useful to be able to refer > to any uppercase unicode character instead of just the typical [A-Z], > which doesn't include, for example É. Is there a way to do this, or > do I have to stick with using the isupper method of the string class? >
You have set yourself a rather daunting task. :-) je suis ici a vous dire grandpere que maintenant nous ecrivons sans accents sans majuscules sans ponctuation sans tout vive le sms vive la revolution les professeurs a la lanterne ah m**** pas des lanternes (-: I would have thought that a full-on NLP parser might be required, even for more-or-less-conventionally-expressed utterances. How will you handle "It's not elementary, Dr. Watson."? However if you persist: there appears to be no way of specifying "an uppercase character" in Python's re module. You are stuck with isupper(). Light entertainment for the speed-freaks: >>> ucucase = set(unichr(i) for i in range(65536) if unichr(i).isupper()) >>> len(ucucase) 704 Is foo in ucucase faster than foo.isupper()? Cheers, John -- http://mail.python.org/mailman/listinfo/python-list