Florian Weimer, 28.01.2011 15:27:
* Stefan Behnel:

The nice thing about Py_UNICODE is that is basically gives you native
Unicode code points directly, without needing to decode UTF-8 byte
runs and the like. In Cython, it allows you to do things like this:

     def test_for_those_characters(unicode s):
         for c in s:
             # warning: randomly chosen Unicode escapes ahead
             if c in u"\u0356\u1012\u3359\u4567":
                 return True
         else:
             return False

The loop runs in plain C, using the somewhat obvious implementation
with a loop over Py_UNICODE characters and a switch statement for the
comparison. This would look a *lot* more ugly with UTF-8 encoded byte
strings.

Not really, because UTF-8 is quite search-friendly.  (The if would
have to invoke a memmem()-like primitive.)  Random subscrips are
problematic.

However, why would one want to write loops like the above?  Don't you
have to take combining characters (comprising multiple codepoints)
into account most of the time when you look at individual characters?
Then UTF-32 does not offer much of a simplification.

Hmm, I think this discussion is pointless. Regardless of the memory layout, you can always go down to the byte level and use an efficient (multi-)substring search algorithm. (which is obviously helped if you know the layout at compile time *wink*)

Bad example, I guess.

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to