Steven Bethard <[EMAIL PROTECTED]> wrote: ... > I have a list[1] of objects from which I need to remove duplicates. I > have to maintain the list order though, so solutions like set(lst), etc. > will not work for me. What are my options? So far, I can see:
I think the recipe by that subject in the 1st Edition Python Cookbook is pretty exhaustive wrt the possibilities that were available up to Python 2.2 under various hypotheses of: are the items hashable, do you want to keep the first one of several duplicates or the last one or some one chosen by arbitrary criteria, etc, etc. In 2.3/2.4, the addition of `set' simplifies many solutions, and itertools simplify others; besides, it appears that you don't care which duplicate is kept, need use equality only (not an equivalence-function), and do have hashable items, so most of the ideas in that recipe aren't needed anyway. > def filterdups(iterable): > seen = set() > for item in iterable: > if item not in seen: > seen.add(item) > yield item > > Does anyone have a better[2] solution? > > STeve > > [1] Well, actually it's an iterable of objects, but I can convert it to > a list if that's helpful. > > [2] Yes I know, "better" is ambiguous. If it helps any, for my > particular situation, speed is probably more important than memory, so > I'm leaning towards the second or third implementation. I doubt you can do _substantially_ better than filterdups, unless some particularly special conditions apply, e.g., the items are hashable but hashing them is very time-consuming. In that case you might avoid hashing items twice, as filterdups does, by some hack such as: def filterdups_hackish(iterable): seen = set() olen = 0 for item in iterable: seen.add(item) nlen = len(seen) if nlen > olen: yield item olen = nlen Here's a toy example of how this might help, given costly hashing: class examp(object): def __init__(self, i): self.i = i//2 def __hash__(self): return self.i**7 % 999999 dups = [examp(x) for x in range(1000)] def fd1(): for x in filterdups(dups): pass def fd2(): for x in filterdups_hackish(dups): pass kallisti:~/cb alex$ python -mtimeit -s'import fid' 'fid.fd1()' 10 loops, best of 3: 55.6 msec per loop kallisti:~/cb alex$ python -mtimeit -s'import fid' 'fid.fd2()' 10 loops, best of 3: 33.4 msec per loop Probably not worth the bother even for this level of hashing costs and density of duplicates, as you see, but maybe an idea keeping in mind if the duplicates are few and hashing is really costly: having both the membership test AND the insertion means you hash the item twice; the hackish alternative is to insert anyway (hashing but once) and see if that made a difference (taking the len is always cheap). Alex -- http://mail.python.org/mailman/listinfo/python-list