I just posted another MAHOUT-228 patch that contains a working and almost useful version of SGD learning. See https://issues.apache.org/jira/browse/MAHOUT-228 for details and the patch.
As part of this effort, I have included my latest ideas on how to vectorize composite documents that include combinations of numerical, textual and categorical information, especially where textual and categorical data has an unbounded vocabulary. An important improvement that I have in the current code is that as vectors are created, a trace is kept which allows the resulting vectors to be reverse engineered. The basic idea is that values are inserted 1 or more times into a vector at hashed locations that depend on the name of the variable (for numeric values) or the name of the variable and the word being inserted (for textual and categorical data). More than one probe is used for textual and categorical data to ameliorate the problem of collisions and I am undecided on the virtues of multiple probes for numerical variables. Each update to the vector leaves a trace in a dictionary so that an arbitrary vector can be reversed back to the original data reasonably well. Currently, the code is highly CSV centric since that is the data I am working with first. The place to see this system in action is examples/src/...classifier/sgd/TrainLogistic and the key class is CsvRecordFactory. A sample command line is on the JIRA ticket. I would love comments and critiques on how this might fit into other areas of our systems. Drew, I especially would like to hear what you think about how this would relate to the Avro document stuff you did. Robin, Jeff, I am curious what you think about these ideas relative to k-means and other clustering techniques. Jake, I am very interested in what you think of this kind of technique relative to your needs for large SVD's. It relates closely, of course, to random projects (it is essentially a random projection).
