Steven Bethard wrote: > nikie wrote: > > Steven Bethard wrote: > > > >> John Salerno wrote: > >>> If I want to make a list of four items, e.g. L = ['C', 'A', 'D', 'B'], > >>> and then figure out if a certain element precedes another element, what > >>> would be the best way to do that? > >>> > >>> Looking at the built-in list functions, I thought I could do something > >>> like: > >>> > >>> if L.index('A') < L.index('D'): > >>> # do some stuff > >> This is probably a pretty reasonable approach as long as you're not too > >> worried about efficiency. It scans the list twice though, so if you > >> start doing this with really big lists, it might be better to do > >> something like: > > > > On average, it'll only have go through half the list twice, as it can > > break as soon as it finds the item it searches for. Don't think you can > > get any more efficient than that. > > Sure you can: > > a_index = None > d_index = None > for i, item in enumerate(L): > if item == 'A': > a_index = i > elif item == 'D': > d_index = i > if a_index is not None and d_index is not None: > break > > if a_index < d_index: > # do some stuff > > That goes through exactly the minimum number of elements. But it's > probably slower than two .index() calls since .index() is coded in C. > But timeit and see if you're interested.
On my PC, calling index twice is 5 times faster. But let's just pretend for a moment Python code was compiled and optimized as well as C code. Then, this function would still not be faster than calling index twice. You seem to be under the impression that "looping through a list" takes a lot of time and "comparing items" takes none. The reverse is the case: Looping through a list means the processor has to do something like "increment an index" once for every item in the list (for optimized C code! Looping through a generator like the one returned by enumerate in interpreted code is a lot more complex). Comparing two items on the other hand involves (as lists aren't statically typed) looking up a cmp-method dynamically, calling it dynamically, and, of course, a string comparison, which again involves two pointer lookups for every character that has to be compared. So, if you want to know how fast your function will be, you'll have to count the number of comparisons. (Of course, we're ignoring the fact that smaller functions tend to run quicker due to cache-, branch-prediction and optimization-effects) If you call your function with the list ['C', 'A', 'D', 'B'], it will compare the first item 'C' to 'A' and 'D', than the second, 'A' to 'A' and the third 'D' to 'A' and 'D', so it'll have to do 5 comparisons, correct? If you call index to find the first occurence of the item 'A' in the same list, it will have to do 2 comparisons (it can break as soon as it finds the iten), and 3 comparisons are needed for finding the item 'D', so it'll do 5 comparisons, too. Plus, you have a small overhead for comparing "a_index" and "d_index" to none (which is faster than a sting-comparison, but still takes time, probably more time than incrementing an index for looping through a list). Things get even worse if 'A' and 'D' aren't "neighbors" in the list, because than all the items bewteen 'A' and 'D' will have to be compared to 'A' and 'D' in the version above, but only to 'D' if you call index twice. So, the function above isn't only less readable, it's also slower on the average case. You might however be able to tweak the performance of the index-version a little bit if you call it only on small chunks of the array at a time, using the index()-versions that take a start- and stop index, so the whole list only has to be moved from the memory to the cache once. But I'm not sure the performance is memory-bound at all (always hard to tell in an interpreted language). > > > >> >>> L = ['C', 'A', 'D', 'B'] > >> >>> positions = dict((item, i) for i, item in enumerate(L)) > >> >>> positions > >> {'A': 1, 'C': 0, 'B': 3, 'D': 2} > >> >>> positions['A'] < positions['D'] > >> True > > > > Isn't this bound to be less efficient? I mean, inserting the items into > > a dict is probably O(n log n) > > No, it's O(N). Dicts are just hash tables. Insertion is O(1). So N > insertions is O(N). I've measured that with the timeit module. The time it takes to build a dict with N entries doesn't seem to be proportional to N, (t-c)/n keeps increasing. Try this: import timeit, math def time(N): return timeit.Timer("dict([('%%010i'%%i,i) for i in range(%i)])" % N).timeit(number=5) c = time(1000)*2-time(2000) for i in range(1000,1000000,1000): t = time(i) print "%5i -> %f (t/n = %f)" % (i,t, (t-c)/i*1000) > This route is also dramatically more efficient if you need to make M > comparisons between items instead of just 1 comparison. In that case, > using the dict is O(N + M) instead of the O(N * M) of the .index() route. Assuming (as I have said already) that you build the dict once, but call index again and again for every comparison, which is of course comparing apples with oranges. -- http://mail.python.org/mailman/listinfo/python-list