There's been a bunch of threads lately about string implementations, and that got me thinking (which is often a dangerous thing).
Let's assume you're testing two strings for equality. You've already done the obvious quick tests (i.e they're the same length), and you're down to the O(n) part of comparing every character. I'm wondering if it might be faster to start at the ends of the strings instead of at the beginning? If the strings are indeed equal, it's the same amount of work starting from either end. But, if it turns out that for real-life situations, the ends of strings have more entropy than the beginnings, the odds are you'll discover that they're unequal quicker by starting at the end. It doesn't seem un-plausible that this is the case. For example, most of the filenames I work with begin with "/home/roy/". Most of the strings I use as memcache keys have one of a small number of prefixes. Most of the strings I use as logger names have common leading substrings. Things like credit card and telephone numbers tend to have much more entropy in the trailing digits. On the other hand, hostnames (and thus email addresses) exhibit the opposite pattern. Anyway, it's just a thought. Has anybody studied this for real-life usage patterns? I'm also not sure how this work with all the possible UCS/UTF encodings. With some of them, you may get the encoding semantics wrong if you don't start from the front. -- http://mail.python.org/mailman/listinfo/python-list