[ 
https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719863#action_12719863
 ] 

Jeff Eastman commented on MAHOUT-65:
------------------------------------

Here's an issue that needs some further discussion:

For text analysis, we are seeing sparse term vectors with about 50k term 
cardinality, and only about 1k term size. If such a sparse vector is serialized 
with Gson, each instance will currently (in 65c anyway) include the 50k element 
bindings map that names each of the vector indices. This is not, IMHO, the kind 
of semantics we want. Nor do I think we want the input vectors to contain this 
redundant overhead.

I would propose to make the bindings map be transient, so that Gson will not 
output the binding maps when it serializes Vectors. Since the map is sharable - 
the API allows it to be set in one method, presumably processors of such term 
vectors would be ok with managing a shared instance independently from the 
vector points themselves. I also can imagine allowing a binding map to be named 
as an optional argument to e.g. Canopy clustering which, if supplied, would 
then be associated with each input point by the code as it is read in. But, 
Canopy does not need the map and its values would not be output either. So why 
bother?

Associating a binding map - or, in general, any similar meta-information - with 
each vector instance only makes sense to me if we have a system-wide policy for 
managing it efficiently. 


> Add Element Labels to Vectors and Matrices
> ------------------------------------------
>
>                 Key: MAHOUT-65
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-65
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Matrix
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch
>
>
> Many applications can benefit by accessing elements in vectors and matrices 
> using String labels in addition to numeric indices. Investigate adding such a 
> capability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to