[ 
https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719884#action_12719884
 ] 

Ted Dunning commented on MAHOUT-65:
-----------------------------------

bq. For text analysis, we are seeing sparse term vectors with about 50k term 
cardinality, and only about 1k term size. If such a sparse vector is serialized 
with Gson, each instance will currently (in 65c anyway) include the 50k element 
bindings map that names each of the vector indices. This is not, IMHO, the kind 
of semantics we want. Nor do I think we want the input vectors to contain this 
redundant overhead.

Generally, these vectors are part of a matrix which should have this mapping 
available at the matrix level rather than at the vector (row) level.  If you do 
that, the cost of storing the label map is amortized over many rows and it 
becomes irrelevant.

The reason that it is very important to store this map at the matrix level is 
so that matrix multiplications can be made efficient.  If I am multiplying a 
matrix with column labels by a vector with element labels, I want the iteration 
to proceed by multiplying elements with the same label.  This can be done by 
putting a permutation between the two operands or by remapping one or them to 
use the other's label map, or by sharing a label map across all elements of 
interest, or by iteratoring over labels instead of indexes.

My general preference is to have the code magically notice if the label map is 
shared (so that iteration over index is safe) and to iterate over labels if not.



> Add Element Labels to Vectors and Matrices
> ------------------------------------------
>
>                 Key: MAHOUT-65
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-65
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Matrix
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>         Attachments: MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch
>
>
> Many applications can benefit by accessing elements in vectors and matrices 
> using String labels in addition to numeric indices. Investigate adding such a 
> capability.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to