"mclovin" <hanoo...@gmail.com> wrote in message news:c5332c9b-2348-4194-bfa0-d70c77107...@x3g2000yqa.googlegroups.com... > Currently I need to find the most common elements in thousands of > arrays within one large array (arround 2 million instances with ~70k > unique elements) > > so I set up a dictionary to handle the counting so when I am > iterating I up the count on the corrosponding dictionary element. I > then iterate through the dictionary and find the 25 most common > elements. > > the elements are initially held in a array within an array. so I am am > just trying to find the most common elements between all the arrays > contained in one large array. > my current code looks something like this: > d = {} > for arr in my_array: > -----for i in arr: > #elements are numpy integers and thus are not accepted as dictionary > keys > -----------d[int(i)]=d.get(int(i),0)+1 > > then I filter things down. but with my algorithm that only takes about > 1 sec so I dont need to show it here since that isnt the problem. > > > But there has to be something better. I have to do this many many > times and it seems silly to iterate through 2 million things just to > get 25. The element IDs are integers and are currently being held in > numpy arrays in a larger array. this ID is what makes up the key to > the dictionary. > > It currently takes about 5 seconds to accomplish this with my current > algorithm. > > So does anyone know the best solution or algorithm? I think the trick > lies in matrix intersections but I do not know.
Would the following work for you, or am I missing something? For a 5Kx5K array, this takes about a tenth of a second on my machine. This code doesn't deal with the sub-array issue. ##################### import numpy import time LOWER = 0 UPPER = 1024 SIZE = 5000 NUM_BEST = 4 # sample data data = numpy.random.randint(LOWER, UPPER, (SIZE,SIZE)).astype(int) time.clock() count = numpy.bincount(data.flat) best = sorted(zip(count, range(len(count))))[-NUM_BEST:] print 'time=', time.clock() print best -- http://mail.python.org/mailman/listinfo/python-list