producing vectors from composite documents

Ted Dunning Mon, 07 Jun 2010 00:02:57 -0700

I just posted another MAHOUT-228 patch that contains a working and almost
useful version of SGD learning.  See
https://issues.apache.org/jira/browse/MAHOUT-228 for details and the patch.


As part of this effort, I have included my latest ideas on how to vectorize
composite documents that include combinations of numerical, textual and
categorical information, especially where textual and categorical data has
an unbounded vocabulary.  An important improvement that I have in the
current code is that as vectors are created, a trace is kept which allows
the resulting vectors to be reverse engineered.  The basic idea is that
values are inserted 1 or more times into a vector at hashed locations that
depend on the name of the variable (for numeric values) or the name of the
variable and the word being inserted (for textual and categorical data).
 More than one probe is used for textual and categorical data to ameliorate
the problem of collisions and I am undecided on the virtues of multiple
probes for numerical variables.   Each update to the vector leaves a trace
in a dictionary so that an arbitrary vector can be reversed back to the
original data reasonably well.  Currently, the code is highly CSV centric
since that is the data I am working with first.

The place to see this system in action is
examples/src/...classifier/sgd/TrainLogistic and the key class is
CsvRecordFactory.  A sample command line is on the JIRA ticket.  I would
love comments and critiques on how this might fit into other areas of our
systems.

Drew, I especially would like to hear what you think about how this would
relate to the Avro document stuff you did.

Robin, Jeff, I am curious what you think about these ideas relative to
k-means and other clustering techniques.

Jake, I am very interested in what you think of this kind of technique
relative to your needs for large SVD's.  It relates closely, of course, to
random projects (it is essentially a random projection).

producing vectors from composite documents

Reply via email to