Albert Strasheim wrote: > Hello all > > I recently started looking at David Cournapeau's PyEM package, specifically > his implementation of the K-Means algorithm. He implemented part of this > algorithm with in pure Python version and also provided a Pyrex alternative > that is significantly faster (about 10 times with the data I tested with). I > tried to figure out why this is so. > > For what it worths, PyEM has been quietly updated in the past weeks: most responses I got initially were people reporting errors almost always related to change in numpy API changes, so I decided to keep quiet for a while, waiting for an API freeze on numpy side (last versions use vq from scipy.cluster, for example, the plotting have been much improved, a fast Ctype version for diagonal gaussian kernel enables to run EM on hundred of thousand frames of several dozens dimensions with several dozens of gaussian in the mixture in reasonable times) > The job of this part of the algorithm is pretty simple: from a bunch of > cluster means (the codebook) find the nearest cluster mean for each data > point. The first attempt at implementing this algorithm might use two for > loops, one over the data points and one over the cluster means, computing a > Euclidean distance at each step. Slightly better is to use one loop and some > broadcasting. > > By randomly trying some code, I arrived at the following one-liner: > > N.sum((data[...,N.newaxis,...]-code)**2, 2).argmin(axis=1) > > where > > data = N.random.randn(2000, 39) > nclusters = 64 > code = data[0:nclusters,:] > > This code was still slow, so I mangled it a bit so that I could profile it > using cProfile under Python 2.5 (use profile if you have an older version of > Python): > > def kmean(): > data = N.random.randn(2000, 39) > nclusters = 64 > code = data[0:nclusters,:] > for i in xrange(10): > def a(): return data[...,N.newaxis,...] > z = a() > def b(): return z-code > z = b() > def c(): return z**2 > z = c() > def d(): return N.sum(z, 2) > z = d() > def e(): return z.argmin(axis=1) > z = e() > > if __name__ == '__main__': > import cProfile > cProfile.run("kmean()") > > Profiler output: > I got exactly the same kind of weird behaviour in other places of PyEM (apparently simple substractions taking a lot of time compared to other parts although it should theoratically be negligeable: eg X - mu where X is (n, d) and mu (d, 1) takeing almost as much time as exp( X**2) !).
It would be great to at least know the origin of this non-intuitive result, David ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion