I have a 2-dimensional Numeric array with the shape (2,N) and I want to remove all duplicate rows from the array. For example if I start out with:
[[1,2], [1,3], [1,2], [2,3]] I want to end up with [[1,2], [1,3], [2,3]]. (Order of the rows doesn't matter, although order of the two elements in each row does.) The problem is that I can't find any way of doing this that is efficient with large data sets (in the data set I am using, N > 1000000) The normal method of removing duplicates by putting the elements into a dictionary and then reading off the keys doesn't work directly because the keys - rows of Python arrays - aren't hashable. The best I have been able to do so far is: def remove_duplicates(x): d = {} for (a,b) in x: d[(a,b)] = (a,b) return array(x.values()) According to the profiler the loop takes about 7 seconds and the call to array() 10 seconds with N=1,700,000. Is there a faster way to do this using Numeric? -Alex Mont
-- http://mail.python.org/mailman/listinfo/python-list