Thomas Moore wrote: > Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on > win32 > Type "help", "copyright", "credits" or "license" for more information. > >>> u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32' > >>> u.split() > [u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32'] > >>> > > I think u should get split.
why? split splits on whitespace (basically unicode category Zs), and there are no whitespace symbols in there: >>> u=u'\u9019\u662f\u4e2d\u6587\u5b57\u4e32' >>> [c.isspace() for c in u] [False, False, False, False, False, False] there's no universal "split on words in all languages" function in the standard python library. You may be able to roll your own using the information in http://www.unicode.org/reports/tr29/ plus functions in the unicodedata module (which currently doesn't include the BreakTest tables; patches are welcome). Or maybe google can help you find an existing implementation. </F> -- http://mail.python.org/mailman/listinfo/python-list