I'm trying to apply Spark to a NLP problem that I'm working around. I have near
4 million tweets text and I have converted them into word vectors. It's pretty
sparse because each message just has dozens of words but the vocabulary has
tens of thousand words.
These vectors should be loaded each time my program handles the data. I stack
these vectors to a 50k(size of voca.)x4M(count of msg.) sparse matrix with
scipy.sparse to persist it on my disk for two reasons: 1) It just costs 400MB
of disk space 2) Loading and parsing it is really fast. (I convert it to
csr_matrix and index each row for the messages)
This works good on my local machine, with common Python and scipy/numpy.
However, It seems Spark does not support scipy.sparse directly. Again, I used a
csr_matrix, and I can extract a specific row and convert to a numpy array
efficiently. But when I parallelize it Spark errored: sparse matrix length is
ambiguous; use getnnz() or shape[0].
csr_matrix does not support len(), so Spark cannot partition it.
Now I use this matrix as a broadcast variable (it's relatively small for the
memory), and parallelize a xrange(0, matrix.shape[0]) list to index the matrix
in map function.
It's there a better solution?
Thanks.