Re: [Numpy-discussion] MemoryError : with scipy.spatial.distance
On Thu, Apr 05, 2012 at 01:05:01PM -0700, Abhishek Pratap wrote: > Also in my case I dont really have a good approximate on value of K in > K-means. That's a hard problem, for which I have no answer, sorry :$ G ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] MemoryError : with scipy.spatial.distance
Also in my case I dont really have a good approximate on value of K in K-means. -A On Thu, Apr 5, 2012 at 8:06 AM, Abhishek Pratap wrote: > Hi Gael > > The MemoryError exception I am getting is from using scikit's DBSCAN > implementation. I can check mini-batch implementation of Kmeans. > > Best, > -Abhi > > On Wed, Apr 4, 2012 at 10:33 PM, Gael Varoquaux > wrote: >> On Wed, Apr 04, 2012 at 04:41:51PM -0700, Abhishek Pratap wrote: >>> Thanks Chris. So I guess the question becomes how can I efficiently >>> cluster 1 million x,y coordinates. >> >> Did you try the scikit-learn's implementation of DBSCAN: >> http://scikit-learn.org/stable/modules/clustering.html#dbscan >> ? I am not sure that it scales, but it's worth trying. >> >> Alternatively, the best way to cluster massive datasets is to use the >> mini-batch implementation of KMeans: >> http://scikit-learn.org/stable/modules/clustering.html#mini-batch-k-means >> >> Hope this helps, >> >> Gael >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] MemoryError : with scipy.spatial.distance
Hi Gael The MemoryError exception I am getting is from using scikit's DBSCAN implementation. I can check mini-batch implementation of Kmeans. Best, -Abhi On Wed, Apr 4, 2012 at 10:33 PM, Gael Varoquaux wrote: > On Wed, Apr 04, 2012 at 04:41:51PM -0700, Abhishek Pratap wrote: >> Thanks Chris. So I guess the question becomes how can I efficiently >> cluster 1 million x,y coordinates. > > Did you try the scikit-learn's implementation of DBSCAN: > http://scikit-learn.org/stable/modules/clustering.html#dbscan > ? I am not sure that it scales, but it's worth trying. > > Alternatively, the best way to cluster massive datasets is to use the > mini-batch implementation of KMeans: > http://scikit-learn.org/stable/modules/clustering.html#mini-batch-k-means > > Hope this helps, > > Gael > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] MemoryError : with scipy.spatial.distance
On Wed, Apr 04, 2012 at 04:41:51PM -0700, Abhishek Pratap wrote: > Thanks Chris. So I guess the question becomes how can I efficiently > cluster 1 million x,y coordinates. Did you try the scikit-learn's implementation of DBSCAN: http://scikit-learn.org/stable/modules/clustering.html#dbscan ? I am not sure that it scales, but it's worth trying. Alternatively, the best way to cluster massive datasets is to use the mini-batch implementation of KMeans: http://scikit-learn.org/stable/modules/clustering.html#mini-batch-k-means Hope this helps, Gael ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] MemoryError : with scipy.spatial.distance
Thanks Chris. So I guess the question becomes how can I efficiently cluster 1 million x,y coordinates. -Abhi On Wed, Apr 4, 2012 at 4:35 PM, Chris Barker wrote: > On Wed, Apr 4, 2012 at 4:17 PM, Abhishek Pratap >> close to a 900K points using DBSCAN algo. My input is a list of ~900k >> tuples each having two points (x,y) coordinates. I am converting them >> to numpy array and passing them to pdist method of >> scipy.spatial.distance for calculating distance between each point. > > I think pdist creates an array that is: > > sum(range(num+points)) in size. > > That's going to be pretty darn big: > > 40499955 elements > > I think that's about 3 terabytes: > > In [41]: sum(range(90)) / 1024. / 1024 / 1024 / 1024 * 8 > Out[41]: 2.946759559563361 > > (for 64 bit floats) > > >> I think the error has something to do with the default double dtype >> of numpy array of pdist function. > > you *may* be able to get it to use float32 -- but as you can see, that > probably won't help enough! > > You'll need a different approach! > > -Chris > > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] MemoryError : with scipy.spatial.distance
On Wed, Apr 4, 2012 at 4:17 PM, Abhishek Pratap > close to a 900K points using DBSCAN algo. My input is a list of ~900k > tuples each having two points (x,y) coordinates. I am converting them > to numpy array and passing them to pdist method of > scipy.spatial.distance for calculating distance between each point. I think pdist creates an array that is: sum(range(num+points)) in size. That's going to be pretty darn big: 40499955 elements I think that's about 3 terabytes: In [41]: sum(range(90)) / 1024. / 1024 / 1024 / 1024 * 8 Out[41]: 2.946759559563361 (for 64 bit floats) > I think the error has something to do with the default double dtype > of numpy array of pdist function. you *may* be able to get it to use float32 -- but as you can see, that probably won't help enough! You'll need a different approach! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] MemoryError : with scipy.spatial.distance
Hey Guys I am new to both python and more so to numpy. I am trying to cluster close to a 900K points using DBSCAN algo. My input is a list of ~900k tuples each having two points (x,y) coordinates. I am converting them to numpy array and passing them to pdist method of scipy.spatial.distance for calculating distance between each point. Here is some size info on my numpy array shape of input array : (828575, 2) Size : 6872000 bytes I think the error has something to do with the default double dtype of numpy array of pdist function. I would appreciate if you could help me debug this. I am sure I overlooking some naive thing here See the traceback below. MemoryError Traceback (most recent call last) /house/homedirs/a/apratap/Dropbox/dev/ipython/ in () 36 37 print cleaned_senseBam ---> 38 cluster_pet_points_per_chromosome(sense_bamFile) /house/homedirs/a/apratap/Dropbox/dev/ipython/ in cluster_pet_points_per_chromosome(bamFile) 30 print 'Size of list points is %d' % sys.getsizeof(points) 31 print 'Size of numpy array is %d' % sys.getsizeof(points_array) ---> 32 cluster_points_DBSCAN(points_array) 33 #print points_array 34 /house/homedirs/a/apratap/Dropbox/dev/ipython/ in cluster_points_DBSCAN(data_numpy_array) 9 def cluster_points_DBSCAN(data_numpy_array): 10 #eucledian distance calculation ---> 11 D = distance.pdist(data_numpy_array) 12 S = distance.squareform(D) 13 H = 1 - S/np.max(S) /house/homedirs/a/apratap/playground/software/epd-7.2-2-rh5-x86_64/lib/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI) 1155 1156 m, n = s -> 1157 dm = np.zeros((m * (m - 1) / 2,), dtype=np.double) 1158 1159 wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm'] ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion