Hard to say what underlying data structure you want here, but it may well
be helpful for the keys to be lexicographically ordered – you may want to
try a SortedDict as provided by the DataStructures
<https://github.com/JuliaLang/DataStructures.jl> package. You may also want
to define a WordVector type that can wrap arbitrary
Associative{UTF8String,Int} structures and provides a default value of
zero, has methods for various vector operations and norms and such. You may
also want to look at the TextAnalysis
<https://github.com/johnmyleswhite/TextAnalysis.jl> package.

On Mon, Apr 13, 2015 at 4:41 PM, Mark Tabor <mtab...@slu.edu> wrote:

> Hello All,
>
> I am looking for an efficient way to represent vectors that exist in an
> infinite dimensional space.  Specifically I am working with large amounts
> of text data and will be receiving a lot of data that contains previously
> unseen words.  Each text represents a vector that exists in the space of
> all possible strings and each word in the text represents a dimension.  As
> such these vectors are extremely sparse.  Currently we handle this by using
> a dictionary to represent each text as a bag of words
> <http://en.wikipedia.org/wiki/Bag-of-words_model> vector.  If a word does
> not exist in the vector we return zero.  This allows use to perform
> computations as so:
>
> ["the"=>3,"and"=>2,"is"=>4] + ["this"=>5,"was"=>1,"where"=>6] =
> ["where"=>6,"the"=>3,"is"=>4,"this"=>5,"was"=>1,"and"=>2]
>
> euclidean(["the"=>3,"and"=>2,"is"=>4], ["this"=>5,"was"=>1,"where"=>6]) =
> 9.539392014169456
>
> Is a dictionary the proper associative structure, or should we use a
> different data structure like a JudyArray or Trie?
>
> -MT
>
>
>

Reply via email to