Hello All,

I am looking for an efficient way to represent vectors that exist in an 
infinite dimensional space.  Specifically I am working with large amounts 
of text data and will be receiving a lot of data that contains previously 
unseen words.  Each text represents a vector that exists in the space of 
all possible strings and each word in the text represents a dimension.  As 
such these vectors are extremely sparse.  Currently we handle this by using 
a dictionary to represent each text as a bag of words 
<http://en.wikipedia.org/wiki/Bag-of-words_model> vector.  If a word does 
not exist in the vector we return zero.  This allows use to perform 
computations as so:

["the"=>3,"and"=>2,"is"=>4] + ["this"=>5,"was"=>1,"where"=>6] = 
["where"=>6,"the"=>3,"is"=>4,"this"=>5,"was"=>1,"and"=>2]

euclidean(["the"=>3,"and"=>2,"is"=>4], ["this"=>5,"was"=>1,"where"=>6]) = 
9.539392014169456

Is a dictionary the proper associative structure, or should we use a 
different data structure like a JudyArray or Trie?

-MT


Reply via email to