Beginner questions on clustering & M/R

Florent Empis Thu, 15 Jul 2010 07:53:08 -0700

Hi,

I want to learn more on clustering techniques. I have skimmed through
Programming Collective Intelligence and Mahout in Action in the past but I
don't have them on hand at the moment... :(
I've seen Isabel Drost mail about test data on http://mldata.org/about/
I've had an idea of using http://mldata.org/repository/view/stockvalues/ for
a pet project.
My idea is as follow: can we see a common behaviour between companies' stock
value?
I would expect ending up with cluster of banking sector shares, utilities
share, media etc... and maybe some more unexpected cluster, who knows?


My idea is basically:
1°)Transform the dataset from values to daily variation as percentage
drop/raise (data is then normalized)
2°)Apply clustering technique(s)

The issue may seem silly but as I understand it, clustering happens in a 2
(or more) dimension space.
I know I have 2 dimensions: variation and time, but I can't wrap my head on
the problem...

I *think* that the K-Means example does exactly what I intend to do my
second step, is this correct?
However, I can grasp what the 2 dimensional display represent exactly: what
are the x and y axis ?

Added question: I am fairly new to the M/R paradigm, but let's say I would
like to do step 1 (data normalization) in a M/R fashion. Would the following
be a good idea:
My data is a matrix of k stock values S in n intervals of time.
I call the first stock in the file, first and second period:
S1,t & S1,t+1 ...

Map Step: input: ((S1,t ... S1,t+n),... ,(Sk,t ... Sk,t+n) )
output (( (S1,t;S1,t+1),...,(S1,t+n-1;S1,t+n)), ... ,(
(Sk,t;Sk,t+1),...,(Sk,t+n-1;Sk,t+n)) )
Reduce Step:
( (%S1,t+1.....%S1,t+n), ...,(%S1,t+1.....%S1,t+n))

I apologize for my beginner's questions but.... everyone has to start
somewhere :-)

BR,

Florent Empis

Beginner questions on clustering & M/R

Reply via email to