Keep in mind, however, that your data are not really vectors in the normal sense. What you have are time series.
To cluster these effectively, you need to extract some kind of features that allow similar students to be recognized as similar. To do that, you need to decide what similar means for your application. One simple thought is to simple subtract the unit level from week to week. Your example of 1,1,2,4,4,5 would be encoded as 0,1,2,0,1 (note that there is one less difference than there are weeks). This would let you group together students who make strong (or week) improvements in the same weeks. You could also add features representing 2, 3 and 4 week improvements so that you can group together students with similar average performance. Extracting features is probably going to be much the hardest step in analyzing your data. On another topic, your data is likely to be relatively small in terms of the number of students being analyzed. It seems unlikely that you will ever have data for a million students. As such, you probably can use many conventional statistical packages quite effectively, and possibly much more easily than Mahout. Mahout is intended to help with scaling data analysis to a very large level and does not necessarily make analyzing small datasets easier than conventional software. I would suggest you take a look at R. On Mon, Mar 22, 2010 at 10:05 AM, Isabel Drost <[email protected]> wrote: > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html > > Probably most interesting for you is the small chapter on "Converting > existing vectors to Mahout's format"
