I don't know off-hand. I don't understand the importance of your constraint either.
On Thu, Aug 30, 2012 at 5:21 AM, dexter morgan <dextermorga...@gmail.com>wrote: > Ok, but as i said before, how do i achieve the same result with out > clustering , just linear. Join on the same data-set basically? > > and calculating the distance as i go > > On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning <tdunn...@maprtech.com>wrote: > >> I don't mean that. >> >> I mean that a k-means clustering with pretty large clusters is a useful >> auxiliary data structure for finding nearest neighbors. The basic outline >> is that you find the nearest clusters and search those for near neighbors. >> The first riff is that you use a clever data structure for finding the >> nearest clusters so that you can do that faster than linear search. The >> second riff is when you use another clever data structure to search each >> cluster quickly. >> >> There are fancier data structures available as well. >> >> >> On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan <dextermorga...@gmail.com >> > wrote: >> >>> Right, but if i understood your sugesstion, you look at the end goal , >>> which is: >>> 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]] >>> >>> for example, and you say: here we see a cluster basically, that cluster >>> is represented by the point: [40.123,-50.432] >>> which points does this cluster contains? [[41.431,- >>> 43.32],[...,...],...,[...]] >>> meaning: that for every point i have in the dataset, you create a >>> cluster. >>> If you don't mean that, but you do mean to create clusters based on some >>> random-seed points or what not, that would mean >>> that i'll have points (talking about the "end goal") that won't have >>> enough points in their list. >>> >>> one of the criterions for a clustering is that for any clusters: C_i and >>> C_j (where i != j), C_i intersect C_j is empty >>> >>> and again, how can i accomplish my task with out running mahout / knn >>> algo? just by calculating distance between points? >>> join of a file with it self. >>> >>> Thanks >>> >>> On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning <tdunn...@maprtech.com>wrote: >>> >>>> >>>> >>>> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan < >>>> dextermorga...@gmail.com> wrote: >>>> >>>>> >>>>> I understand your solution ( i think) , didn't think of that, in that >>>>> particular way. >>>>> I think that lets say i have 1M data-points, and running knn , that >>>>> the k=1M and n=10 (each point is a cluster that requires up to 10 points) >>>>> is an overkill. >>>>> >>>> >>>> I am not sure I understand you. n = number of points. k = number of >>>> clusters. For searching 1 million points, I would recommend thousands of >>>> clusters. >>>> >>>> >>>>> How can i achieve the same result WITHOUT using mahout, just running >>>>> on the dataset , i even think it'll be in the same complexity (o(n^2)) >>>>> >>>> >>>> Running with a good knn package will give you roughly O(n log n) >>>> complexity. >>>> >>>> >>> >> >