I'm trying to apply Spark to a NLP problem that I'm working around. I have near 
4 million tweets text and I have converted them into word vectors. It's pretty 
sparse because each message just has dozens of words but the vocabulary has 
tens of thousand words.
These vectors should be loaded each time my program handles the data. I stack 
these vectors to a 50k(size of voca.)x4M(count of msg.) sparse matrix with 
scipy.sparse to persist it on my disk for two reasons: 1) It just costs 400MB 
of disk space 2) Loading and parsing it is really fast. (I convert it to 
csr_matrix and index each row for the messages)
This works good on my local machine, with common Python and scipy/numpy. 
However, It seems Spark does not support scipy.sparse directly. Again, I used a 
csr_matrix, and I can extract a specific row and convert to a numpy array 
efficiently. But when I parallelize it Spark errored: sparse matrix length is 
ambiguous; use getnnz() or shape[0].
csr_matrix does not support len(), so Spark cannot partition it.
Now I use this matrix as a broadcast variable (it's relatively small for the 
memory), and parallelize a xrange(0, matrix.shape[0]) list to index the matrix 
in map function.
It's there a better solution?
Thanks.                                           

Reply via email to