Re: Reg: Mahout - Data Input

Ted Dunning Mon, 22 Mar 2010 15:05:05 -0700

Keep in mind, however, that your data are not really vectors in the normal
sense.  What you have are time series.

To cluster these effectively, you need to extract some kind of features that
allow similar students to be recognized as similar.  To do that, you need to
decide what similar means for your application.

One simple thought is to simple subtract the unit level from week to week.
Your example of 1,1,2,4,4,5 would be encoded as 0,1,2,0,1 (note that there
is one less difference than there are weeks).  This would let you group
together students who make strong (or week) improvements in the same weeks.
You could also add features representing 2, 3 and 4 week improvements so
that you can group together students with similar average performance.

Extracting features is probably going to be much the hardest step in
analyzing your data.

On another topic, your data is likely to be relatively small in terms of the
number of students being analyzed.  It seems unlikely that you will ever
have data for a million students.  As such, you probably can use many
conventional statistical packages quite effectively, and possibly much more
easily than Mahout.  Mahout is intended to help with scaling data analysis
to a very large level and does not necessarily make analyzing small datasets
easier than conventional software.

I would suggest you take a look at R.

On Mon, Mar 22, 2010 at 10:05 AM, Isabel Drost <[email protected]> wrote:

> http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
>
> Probably most interesting for you is the small chapter on "Converting
> existing vectors to Mahout's format"

Re: Reg: Mahout - Data Input

Reply via email to