"Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Josiah Carlson schrieb: > > What about a tree structure over the top of the string as I described in > > another post? If there are no surrogate pairs, the pointer to the tree > > is null. If there are surrogate pairs, we could either use the > > structure as I described, or even modify it so that we get even better > > memory utilization/performance (choose tree nodes based on where > > surrogate pairs are, up to some limit). > > As always, it's a time-vs-space tradeoff. People tend to resolve these > in favor of time, accepting an increase in space. I'm not so sure this > is always the right answer. In the specific case, I'm also worried about > the increase in complexness. > > That said, it is always good to have a prototype implementation to > analyse the consequences better.
I'm away from my main machine at the moment, so I am unable to test my implementation, but I do have a sample. There are two main functions to this implementation. One which constructs a tree for O(log n) worst-case access to character addresses, and one which traverses the tree to discover the character address. For strings without surrogates, it's O(1) character address discovery. The implementation of surrogate discovery is very simple, using section 3.8 and 5.4 in the Unicode 4.0 standard. If there are no surrogates, it takes a single pass over the input, and constructs a single node (12 or 24 bytes, depending on the build, need to replace long with Py_ssize_t). If there are surrogates, it creates a block of nodes, adjusts pointers to create a tree, and returns a pointer to the root. The tree will have at most O(n/logn) nodes, though it will tend to create long blocks of non-surrogates, so that if you have a single surrogate in the middle of a huge string, it will be conceptually viewed as 3 blocks. Attached is my untested sample implementation (I'm away for the next week or so, and can't test), that should give an idea of what I was talking about. - Josiah
surrogate_tree.c
Description: Binary data
_______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
