[ https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719863#action_12719863 ]
Jeff Eastman commented on MAHOUT-65: ------------------------------------ Here's an issue that needs some further discussion: For text analysis, we are seeing sparse term vectors with about 50k term cardinality, and only about 1k term size. If such a sparse vector is serialized with Gson, each instance will currently (in 65c anyway) include the 50k element bindings map that names each of the vector indices. This is not, IMHO, the kind of semantics we want. Nor do I think we want the input vectors to contain this redundant overhead. I would propose to make the bindings map be transient, so that Gson will not output the binding maps when it serializes Vectors. Since the map is sharable - the API allows it to be set in one method, presumably processors of such term vectors would be ok with managing a shared instance independently from the vector points themselves. I also can imagine allowing a binding map to be named as an optional argument to e.g. Canopy clustering which, if supplied, would then be associated with each input point by the code as it is read in. But, Canopy does not need the map and its values would not be output either. So why bother? Associating a binding map - or, in general, any similar meta-information - with each vector instance only makes sense to me if we have a system-wide policy for managing it efficiently. > Add Element Labels to Vectors and Matrices > ------------------------------------------ > > Key: MAHOUT-65 > URL: https://issues.apache.org/jira/browse/MAHOUT-65 > Project: Mahout > Issue Type: New Feature > Components: Matrix > Affects Versions: 0.1 > Reporter: Jeff Eastman > Assignee: Jeff Eastman > Attachments: MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch > > > Many applications can benefit by accessing elements in vectors and matrices > using String labels in addition to numeric indices. Investigate adding such a > capability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.