Re: mahout examples

prasenjit mukherjee Sun, 29 Nov 2009 23:07:35 -0800

Thanks  for sharing the article.  The article focuses mainly on
distance computation between sequences, which will help us in creating
the self-similarity matrix.  And then you can probably apply any
standard self-similarity based clustering techniques ( spectral
clustering or k-means etc. ).


Approach sounds okay, except that k-means requires the nXn matrix to
be computed which itself could be pretty huge.  But as long as you can
distribute ( which you apparantly can ) over mapreduce/mahout it
should be fine.

-Prasen

On Fri, Nov 27, 2009 at 9:47 PM, Patterson, Josh <[email protected]> wrote:
> Prasen,
> I've been reviewing techniques and literature on data mining in time
> series and I found another paper that you might be interested in from
> the time series "search" domain that deals with similarity of time
> series data:
>
> http://delab.csd.auth.gr/papers/PCI99am.pdf
>
> Sequences are transformed into a feature vector and Euclidean distances
> between the feature vectors are then calculated. I'm still getting this
> concept (plus other variations) and mahout in general "mapped out". I
> read some on suffix trees and they look very similar to k-grams and
> permutation indexes in Information Retrieval material. I'm still
> digesting this time series problem (and its several sub problems) but I
> thought I'd throw that paper out there and see what you thought.
>
> Josh Patterson
> TVA
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of
> prasenjit mukherjee
> Sent: Saturday, November 21, 2009 12:21 AM
> To: [email protected]
> Subject: Re: mahout examples
>
> Hi Josh,
>
> I too am working on  clustering time-series-data, and basically trying
> to come up with a sequence clustering model. Would like to know how
> you intend to use K-means to achieve that.  Are you  treating each
> sequence as a point ?  Then, what would be your vector representation
> of a sequence and also more importantly which metric ( distance
> computation logic ) will you be using ?
>
> BTW, I am thinking along the lines of STC ( suffix-tree based clustering
> ).
>
> -Prasen
>
> On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <[email protected]>
> wrote:
>> I think in terms of clustering time series data, the first step looks
> to
>> be vectorizing the input cases with possibly the DenseVector class and
>> feeding that to a basic KMeans implementation like KMeansDriver.java.
>> Once we can get the basic kmeans rolling with some known dataset we'll
>> be able to iterate on that and move towards using more complex
>> techniques and other grid timeseries data. Any suggestions or
> discussion
>> is greatly appreciated,
>>
>> Josh Patterson
>> TVA
>

Re: mahout examples

Reply via email to