Hard to say what underlying data structure you want here, but it may well be helpful for the keys to be lexicographically ordered – you may want to try a SortedDict as provided by the DataStructures <https://github.com/JuliaLang/DataStructures.jl> package. You may also want to define a WordVector type that can wrap arbitrary Associative{UTF8String,Int} structures and provides a default value of zero, has methods for various vector operations and norms and such. You may also want to look at the TextAnalysis <https://github.com/johnmyleswhite/TextAnalysis.jl> package.
On Mon, Apr 13, 2015 at 4:41 PM, Mark Tabor <mtab...@slu.edu> wrote: > Hello All, > > I am looking for an efficient way to represent vectors that exist in an > infinite dimensional space. Specifically I am working with large amounts > of text data and will be receiving a lot of data that contains previously > unseen words. Each text represents a vector that exists in the space of > all possible strings and each word in the text represents a dimension. As > such these vectors are extremely sparse. Currently we handle this by using > a dictionary to represent each text as a bag of words > <http://en.wikipedia.org/wiki/Bag-of-words_model> vector. If a word does > not exist in the vector we return zero. This allows use to perform > computations as so: > > ["the"=>3,"and"=>2,"is"=>4] + ["this"=>5,"was"=>1,"where"=>6] = > ["where"=>6,"the"=>3,"is"=>4,"this"=>5,"was"=>1,"and"=>2] > > euclidean(["the"=>3,"and"=>2,"is"=>4], ["this"=>5,"was"=>1,"where"=>6]) = > 9.539392014169456 > > Is a dictionary the proper associative structure, or should we use a > different data structure like a JudyArray or Trie? > > -MT > > >