There's been a bunch of threads lately about string implementations, and 
that got me thinking (which is often a dangerous thing).

Let's assume you're testing two strings for equality.  You've already 
done the obvious quick tests (i.e they're the same length), and you're 
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings 
instead of at the beginning?  If the strings are indeed equal, it's the 
same amount of work starting from either end.  But, if it turns out that 
for real-life situations, the ends of strings have more entropy than the 
beginnings, the odds are you'll discover that they're unequal quicker by 
starting at the end.

It doesn't seem un-plausible that this is the case.  For example, most 
of the filenames I work with begin with "/home/roy/".  Most of the 
strings I use as memcache keys have one of a small number of prefixes.  
Most of the strings I use as logger names have common leading 
substrings.  Things like credit card and telephone numbers tend to have 
much more entropy in the trailing digits.

On the other hand, hostnames (and thus email addresses) exhibit the 
opposite pattern.

Anyway, it's just a thought.  Has anybody studied this for real-life 
usage patterns?

I'm also not sure how this work with all the possible UCS/UTF encodings.  
With some of them, you may get the encoding semantics wrong if you don't 
start from the front.
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to