We support sparse vectors in MLlib, which recognizes MLlib's sparse vector and SciPy's csc_matrix with a single column. You can create RDD of sparse vectors for your data and save/load them to/from parquet format using dataframes. Sparse matrix supported will be added in 1.4. -Xiangrui
On Mon, Apr 6, 2015 at 7:58 AM, SecondDatke <lovejay-lovemu...@outlook.com> wrote: > I'm trying to apply Spark to a NLP problem that I'm working around. I have > near 4 million tweets text and I have converted them into word vectors. It's > pretty sparse because each message just has dozens of words but the > vocabulary has tens of thousand words. > > These vectors should be loaded each time my program handles the data. I > stack these vectors to a 50k(size of voca.)x4M(count of msg.) sparse matrix > with scipy.sparse to persist it on my disk for two reasons: 1) It just costs > 400MB of disk space 2) Loading and parsing it is really fast. (I convert it > to csr_matrix and index each row for the messages) > > This works good on my local machine, with common Python and scipy/numpy. > However, It seems Spark does not support scipy.sparse directly. Again, I > used a csr_matrix, and I can extract a specific row and convert to a numpy > array efficiently. But when I parallelize it Spark errored: sparse matrix > length is ambiguous; use getnnz() or shape[0]. > > csr_matrix does not support len(), so Spark cannot partition it. > > Now I use this matrix as a broadcast variable (it's relatively small for the > memory), and parallelize a xrange(0, matrix.shape[0]) list to index the > matrix in map function. > > It's there a better solution? > > Thanks. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org