Here is a good paper on initializing k-means with kd-trees: 
http://www.sciencedirect.com/science/article/pii/S0167865507000165
You can use FLANN package for large data sets. Make sure to read FLANN 
documentation 
<http://www.cs.ubc.ca/research/flann/uploads/FLANN/flann_manual-1.8.4.pdf> 
for familiarizing with its options.

-- Art


On Monday, July 28, 2014 11:20:08 AM UTC-4, John Myles White wrote:
>
> Art wrapped FLANN, which should make things much easier.
>
>  — John
>
> On Jul 28, 2014, at 8:19 AM, Dahua Lin <lind...@gmail.com <javascript:>> 
> wrote:
>
> We should probably incorporate (approximate) nearest neighbor search into 
> Clustering.jl at some point.
>
> Dahua
>
>
> On Monday, July 28, 2014 10:09:59 AM UTC-5, John Myles White wrote:
>>
>> FWIW, there’s a KD-tree implementation in NearestNeighbors.jl
>>
>>  — John
>>
>> On Jul 28, 2014, at 7:27 AM, Jacob Quinn <quinn....@gmail.com> wrote:
>>
>> This probably isn't very helpful currently, but I've been meaning to try 
>> to do a `kd-tree` implementation that allows for fast clustering for up to 
>> 7-10 dimensions. (there's also ad-trees for categorical data that has even 
>> better performance gains over traditional algorithms).
>>
>>
>> http://www.autonlab.org/autonweb/14669/version/2/part/5/data/moore-veryfast.pdf?branch=main&language=en
>>
>> As a fun fact, Andrew Moore (author of the two algorithms/data structures 
>> mentioned above) started the Google Pittsburgh office after leaving CMU and 
>> he's just agreed to come back to CMU as the new dean of computer science!
>>
>> -Jacob
>>
>>
>>
>> On Mon, Jul 28, 2014 at 9:31 AM, René Donner <li...@donner.at> wrote:
>>
>>> Hi,
>>>
>>> perhaps Quick-Shift clustering might be interesting as well [1]. It is 
>>> easy to implement, fast, and in contrast to k-means / k-medoids (which it 
>>> generalizes) has the very appealing property that the initial, hierachical 
>>> data-structure has to be computed only once - you can then investigate 
>>> different settings of the parameter \tau (the splitting criterium) 
>>> extremely fast.
>>>
>>> In many cases it is easier to find a reasonable \tau than to come up 
>>> with the exact number of clusters your data is expected to have.
>>>
>>> Cheers,
>>>
>>> Rene
>>>
>>> [1] http://www.robots.ox.ac.uk/~vedaldi/assets/pubs/vedaldi08quick.pdf
>>>
>>>
>>>
>>>
>>>
>>> Am 28.07.2014 um 15:06 schrieb Randy Zwitch <randy....@fuqua.duke.edu>:
>>>
>>> > I'm about to undertake a clustering exercise for a lot of data 
>>> (Roughly 100MM rows*12 columns for every week, mixed floats/ints, for as 
>>> many weeks as I choose to use). I was going to attempt to downsample to 1MM 
>>> events or so and use the Clustering.jl package to try and pre-gather some 
>>> idea of how many clusters to estimate, since clustering a billion or more 
>>> events will take a bit of computation time. I'm familiar with the 'elbow 
>>> method' of determining k, but that seems a bit arbitrary.
>>> >
>>> > Is anyone familiar with either of the techniques described in these 
>>> two papers? There is a blog post (link) that states that the f(K) method is 
>>> an order of magnitude better in performance time by removing the need for 
>>> monte carlo methods. If anyone has any practical experience with these or 
>>> advice about other methods (bonus for providing Julia code!), it would be 
>>> much appreciated.
>>> >
>>> > http://www.stanford.edu/~hastie/Papers/gap.pdf
>>> >
>>> > http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf
>>> >
>>> >
>>>
>>>
>>
>>
>

Reply via email to