Roy Smith <r...@panix.com> writes: > There's been a bunch of threads lately about string implementations, and > that got me thinking (which is often a dangerous thing). > > Let's assume you're testing two strings for equality. You've already > done the obvious quick tests (i.e they're the same length), and you're > down to the O(n) part of comparing every character. > > I'm wondering if it might be faster to start at the ends of the strings > instead of at the beginning? If the strings are indeed equal, it's the > same amount of work starting from either end. But, if it turns out that > for real-life situations, the ends of strings have more entropy than the > beginnings, the odds are you'll discover that they're unequal quicker by > starting at the end.
I guess we all have examples with specific entropy distribution. Going backwards on long strings can be profitable with file paths. If that is the case, and if you spend a lot of time comparing strings, why not store them in reverse? > It doesn't seem un-plausible that this is the case. For example, most > of the filenames I work with begin with "/home/roy/". Most of the > strings I use as memcache keys have one of a small number of prefixes. > Most of the strings I use as logger names have common leading > substrings. In that case I would split the paths, and use some sort of string-id, or use intern() and then compare components with "is" instead of "==". (Basically, turning the string of chars into a strings of ids/ints/pointers.) > Things like credit card and telephone numbers tend to have much more > entropy in the trailing digits. On the other hand, hostnames (and thus > email addresses) exhibit the opposite pattern. Yes, real-world is a mess. > Anyway, it's just a thought. Has anybody studied this for real-life > usage patterns? I've tried the "intern()" trick several times with lists of strings, and was satisfied, but this was in specific cases (with a finite set of strings). For "short" strings of chars (up to, say a hundred of characters), I've never found anything significantly faster than the default sweep (that was in C/C++, not python). For longer and/or more structured strings/lists, making use of the structure is a good idea, but I see no general pattern here (as you note). > I'm also not sure how this work with all the possible UCS/UTF encodings. > With some of them, you may get the encoding semantics wrong if you don't > start from the front. If you're testing for equality, I can't see how this could matter, even with variable-length encodings. If you're comparing different encodings, then you need different access methods to random characters (but this doesn't affect the algorithm). If you're using variable-length encoding, e.g., UTF-8, accessing at a specific position is not possible. -- Alain. -- http://mail.python.org/mailman/listinfo/python-list