Thomas Moore wrote:

> Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on
> win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'
> >>> u.split()
> [u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32']
> >>>
> I think u should get split.

why?  split splits on whitespace (basically unicode category Zs), and
there are no whitespace symbols in there:

>>> u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'
>>> [c.isspace() for c in u]
[False, False, False, False, False, False]

there's no universal "split on words in all languages" function in the
standard python library.  You may be able to roll your own using the
information in plus functions
in the unicodedata module (which currently doesn't include the
BreakTest tables; patches are welcome).  Or maybe google can
help you find an existing implementation.



