>In article <is9ikg0...@news1.newsguy.com>, > Chris Torek <nos...@torek.net> wrote: >> Python might be penalized by its use of Unicode here, since a >> Boyer-Moore table for a full 16-bit Unicode string would need >> 65536 entries (one per possible ord() value).
In article <roy-751fac.23443902062...@news.panix.com> Roy Smith <r...@panix.com> wrote: >I'm not sure what you mean by "full 16-bit Unicode string"? Isn't >unicode inherently 32 bit? Well, not exactly. As I understand it, Python is normally built with a 16-bit "unicode character" type though (using either UCS-2 or UTF-16 internally; but I admit I have been far too lazy to look up stuff like surrogates here :-) ). >In any case, while I could imagine building a 2^16 entry jump table, >clearly it's infeasible (with today's hardware) to build a 2^32 entry >table. But, there's nothing that really requires you to build a table at >all. If I understand the algorithm right, all that's really required is >that you can map a character to a shift value. Right. See the URL I included for an example. The point here, though, is ... well: >For an 8 bit character set, an indexed jump table makes sense. For a >larger character set, I would imagine you would do some heuristic >pre-processing to see if your search string consisted only of characters >in one unicode plane and use that fact to build a table which only >indexes that plane. Or, maybe use a hash table instead of a regular >indexed table. Just so. You have to pay for one scan through the string to build a hash-table of offsets -- an expense similar to that for building the 256-entry 8-bit table, perhaps, depending on string length -- but then you pay again for each character looked-at, since: skip = hashed_lookup(table, this_char); is a more complex operation than: skip = table[this_char]; (where table is a simple array, hence the C-style semicolons: this is not Python pseudo-code :-) ). Hence, a "penalty". >Not as fast, but only slower by a small constant factor, >which is not a horrendous price to pay in a fully i18n world :-) Indeed. -- In-Real-Life: Chris Torek, Wind River Systems Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603 email: gmail (figure it out) http://web.torek.net/torek/index.html
-- http://mail.python.org/mailman/listinfo/python-list