@Isabel: Thank you for the responding and sharing me some documentation. @Ted: Thank you so much, taking time and looking into my problem. I looked at the R Package and it definitely seems useful to our dataset as it is relatively very small.
But I definitely would love to try Mahout as it has a lot of potential, for solving many other problems for large data sets. I have been working on setting up a small Hadoop cluster at our university and now playing around with various real world problems. So coming back to your suggestion of finding the week to week progress and comparing the progress is really a good idea. Thank you for suggesting it. For me to implement that using Mahout, can I use a DenseMatrix to store our data and write it to the sequence file as mentioned @ http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html . And once that is done how can I group the students based on the improvement every corresponding weeks, do I need to write any Classes which can use the existing Algorithm or can I use the classes in the SyntheticControl package for the data since it is a time series. Thanks Mahi PS: Please bear with my questions as they may be trivial for most of you. I am just a beginner and a student who is looking forward to work with Hadoop and its sub-projects. On Mon, Mar 22, 2010 at 6:04 PM, Ted Dunning <[email protected]> wrote: > Keep in mind, however, that your data are not really vectors in the normal > sense. What you have are time series. > > To cluster these effectively, you need to extract some kind of features > that > allow similar students to be recognized as similar. To do that, you need > to > decide what similar means for your application. > > One simple thought is to simple subtract the unit level from week to week. > Your example of 1,1,2,4,4,5 would be encoded as 0,1,2,0,1 (note that there > is one less difference than there are weeks). This would let you group > together students who make strong (or week) improvements in the same weeks. > You could also add features representing 2, 3 and 4 week improvements so > that you can group together students with similar average performance. > > Extracting features is probably going to be much the hardest step in > analyzing your data. > > On another topic, your data is likely to be relatively small in terms of > the > number of students being analyzed. It seems unlikely that you will ever > have data for a million students. As such, you probably can use many > conventional statistical packages quite effectively, and possibly much more > easily than Mahout. Mahout is intended to help with scaling data analysis > to a very large level and does not necessarily make analyzing small > datasets > easier than conventional software. > > I would suggest you take a look at R. > > > On Mon, Mar 22, 2010 at 10:05 AM, Isabel Drost <[email protected]> wrote: > > > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html > > > > Probably most interesting for you is the small chapter on "Converting > > existing vectors to Mahout's format" >
