[ https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719884#action_12719884 ]
Ted Dunning commented on MAHOUT-65: ----------------------------------- bq. For text analysis, we are seeing sparse term vectors with about 50k term cardinality, and only about 1k term size. If such a sparse vector is serialized with Gson, each instance will currently (in 65c anyway) include the 50k element bindings map that names each of the vector indices. This is not, IMHO, the kind of semantics we want. Nor do I think we want the input vectors to contain this redundant overhead. Generally, these vectors are part of a matrix which should have this mapping available at the matrix level rather than at the vector (row) level. If you do that, the cost of storing the label map is amortized over many rows and it becomes irrelevant. The reason that it is very important to store this map at the matrix level is so that matrix multiplications can be made efficient. If I am multiplying a matrix with column labels by a vector with element labels, I want the iteration to proceed by multiplying elements with the same label. This can be done by putting a permutation between the two operands or by remapping one or them to use the other's label map, or by sharing a label map across all elements of interest, or by iteratoring over labels instead of indexes. My general preference is to have the code magically notice if the label map is shared (so that iteration over index is safe) and to iterate over labels if not. > Add Element Labels to Vectors and Matrices > ------------------------------------------ > > Key: MAHOUT-65 > URL: https://issues.apache.org/jira/browse/MAHOUT-65 > Project: Mahout > Issue Type: New Feature > Components: Matrix > Affects Versions: 0.1 > Reporter: Jeff Eastman > Assignee: Jeff Eastman > Attachments: MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch > > > Many applications can benefit by accessing elements in vectors and matrices > using String labels in addition to numeric indices. Investigate adding such a > capability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.