Re: How to work with sparse data in Python?

Xiangrui Meng Mon, 06 Apr 2015 12:30:06 -0700

We support sparse vectors in MLlib, which recognizes MLlib's sparse
vector and SciPy's csc_matrix with a single column. You can create RDD
of sparse vectors for your data and save/load them to/from parquet
format using dataframes. Sparse matrix supported will be added in 1.4.
-Xiangrui


On Mon, Apr 6, 2015 at 7:58 AM, SecondDatke
<lovejay-lovemu...@outlook.com> wrote:
> I'm trying to apply Spark to a NLP problem that I'm working around. I have
> near 4 million tweets text and I have converted them into word vectors. It's
> pretty sparse because each message just has dozens of words but the
> vocabulary has tens of thousand words.
>
> These vectors should be loaded each time my program handles the data. I
> stack these vectors to a 50k(size of voca.)x4M(count of msg.) sparse matrix
> with scipy.sparse to persist it on my disk for two reasons: 1) It just costs
> 400MB of disk space 2) Loading and parsing it is really fast. (I convert it
> to csr_matrix and index each row for the messages)
>
> This works good on my local machine, with common Python and scipy/numpy.
> However, It seems Spark does not support scipy.sparse directly. Again, I
> used a csr_matrix, and I can extract a specific row and convert to a numpy
> array efficiently. But when I parallelize it Spark errored: sparse matrix
> length is ambiguous; use getnnz() or shape[0].
>
> csr_matrix does not support len(), so Spark cannot partition it.
>
> Now I use this matrix as a broadcast variable (it's relatively small for the
> memory), and parallelize a xrange(0, matrix.shape[0]) list to index the
> matrix in map function.
>
> It's there a better solution?
>
> Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to work with sparse data in Python?

Reply via email to