Re: SVD on larger than taller matrix

2014-09-18 Thread Li Pu
The main bottleneck of current SVD implementation is on the memory of
driver node. It requires at least 5*n*k doubles in driver memory because
all right singular vectors are stored in driver memory and there are some
working memory required. So it is bounded by the smaller dimension of your
matrix and k. For the worker nodes, memory requirements should be much
smallers as long as you can distribute your sparse matrix in worker's
memory. If possible, you can ask for more memory for driver node while
keeping worker nodes memory small.

Meanwhile we are working on removing this limitation by implementing
distributed QR and Lanczos in Spark.

On Thu, Sep 18, 2014 at 1:26 PM, Xiangrui Meng  wrote:

> Did you cache `features`? Without caching it is slow because we need
> O(k) iterations. The storage requirement on the driver is about 2 * n
> * k = 2 * 3 million * 200 ~= 9GB, not considering any overhead.
> Computing U is also an expensive task in your case. We should use some
> randomized SVD implementation for your data, but this is not available
> now. I would recommend setting driver-memory 25g, caching `features`,
> and using a smaller k. -Xiangrui
>
> On Thu, Sep 18, 2014 at 1:02 PM, Glitch  wrote:
> > I have a matrix of about 2 millions+ rows with 3 millions + columns in
> svm
> > format* and it's sparse. As I understand it, running SVD on such a matrix
> > shouldn't be a problem since version 1.1.
> >
> > I'm using 10 worker nodes on EC2, each with 30G of RAM (r3.xlarge). I was
> > able to compute the SVD for 20 singular values, but it fails with a Java
> > Heap Size error for 200 singular values. I'm currently trying 100.
> >
> > So my question is this, what kind of cluster do you need to perform this
> > task?
> > As I do not have any measurable experience with Spark I can't say if
> this is
> > normal: my test for 100 singular values has been running for over an
> hour.
> >
> > I'm using this dataset
> > http://archive.ics.uci.edu/ml/datasets/URL+Reputation
> >
> > I'm using the spark-shell with --executor-memory 15G --driver-memory 15G
> >
> >
> > And the few lines of codes are
> > /import org.apache.spark.mllib.linalg.distributed.RowMatrix
> > import org.apache.spark.mllib.util.MLUtils
> > val data = MLUtils.loadLibSVMFile(sc, "all.svm",3231961)
> > val features = data.map(line => line.features)
> > val mat = new RowMatrix(features)
> > val svd = mat.computeSVD(200, computeU= true)/
> >
> >
> > svm format:  :value
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SVD-on-larger-than-taller-matrix-tp14611.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Li
@vrilleup


Re: How can I implement eigenvalue decomposition in Spark?

2014-08-08 Thread Li Pu
@Miles, eigen-decomposition with asymmetric matrix doesn't always give
real-value solutions, and it doesn't have the nice properties that
symmetric matrix holds. Usually you want to symmetrize your asymmetric
matrix in some way, e.g. see
http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_ZhouHS05.pdf
but as Sean mentioned, you can always compute the largest eigenpair with
power method or some variations like pagerank, which is already implemented
in graphx.


On Fri, Aug 8, 2014 at 2:50 AM, Sean Owen  wrote:

> The SVD does not in general give you eigenvalues of its input.
>
> Are you just trying to access the U and V matrices? they are also
> returned in the API.  But they are not the eigenvectors of M, as you
> note.
>
> I don't think MLlib has anything to help with the general eigenvector
> problem.
> Maybe you can implement a sort of power iteration algorithm using
> GraphX to find the largest eigenvector?
>
> On Fri, Aug 8, 2014 at 4:07 AM, Chunnan Yao  wrote:
> > Hi there, what you've suggested are all meaningful. But to make myself
> > clearer, my essential problems are:
> > 1. My matrix is asymmetric, and it is a probabilistic adjacency matrix,
> > whose entries(a_ij) represents the likelihood that user j will broadcast
> the
> > information generated by user i. Apparently, a_ij and a_ji is different,
> > caus I love you doesn't necessarily mean you love me(What a sad story~).
> All
> > entries are real.
> > 2. I know I can get eigenvalues through SVD. My problem is I can't get
> the
> > corresponding eigenvectors, which requires solving equations, and I also
> > need eigenvectors in my calculation.In my simulation of this paper, I
> only
> > need the biggest eigenvalues and corresponding eigenvectors.
> > The paper posted by Shivaram Venkataraman is also concerned about
> symmetric
> > matrix. Could any one help me out?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Li
@vrilleup


Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Li Pu
@Miles, the latest SVD implementation in mllib is partially distributed.
Matrix-vector multiplication is computed among all workers, but the right
singular vectors are all stored in the driver. If your symmetric matrix is
n x n and you want the first k eigenvalues, you will need to fit n x k
doubles in driver's memory. Behind the scene, it calls ARPACK to compute
eigen-decomposition of A^T A. You can look into the source code for the
details.

@Sean, the SVD++ implementation in graphx is not the canonical definition
of SVD. It doesn't have the orthogonality that SVD holds. But we might want
to use graphx as the underlying matrix representation for mllib.SVD to
address the problem of skewed entry distribution.


On Thu, Aug 7, 2014 at 10:51 AM, Evan R. Sparks 
wrote:

> Reza Zadeh has contributed the distributed implementation of (Tall/Skinny)
> SVD (
> http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html),
> which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark
> 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data is
> sparse (which it often is in social networks), you may have better luck
> with this.
>
> I haven't tried the GraphX implementation, but those algorithms are often
> well-suited for power-law distributed graphs as you might see in social
> networks.
>
> FWIW, I believe you need to square elements of the sigma matrix from the
> SVD to get the eigenvalues.
>
>
>
>
> On Thu, Aug 7, 2014 at 10:20 AM, Sean Owen  wrote:
>
>> (-incubator, +user)
>>
>> If your matrix is symmetric (and real I presume), and if my linear
>> algebra isn't too rusty, then its SVD is its eigendecomposition. The
>> SingularValueDecomposition object you get back has U and V, both of
>> which have columns that are the eigenvectors.
>>
>> There are a few SVDs in the Spark code. The one in mllib is not
>> distributed (right?) and is probably not an efficient means of
>> computing eigenvectors if you really just want a decomposition of a
>> symmetric matrix.
>>
>> The one I see in graphx is distributed? I haven't used it though.
>> Maybe it could be part of a solution.
>>
>>
>>
>> On Thu, Aug 7, 2014 at 2:21 PM, yaochunnan  wrote:
>> > Our lab need to do some simulation on online social networks. We need to
>> > handle a 5000*5000 adjacency matrix, namely, to get its largest
>> eigenvalue
>> > and corresponding eigenvector. Matlab can be used but it is
>> time-consuming.
>> > Is Spark effective in linear algebra calculations and transformations?
>> Later
>> > we would have 500*500 matrix processed. It seems emergent that
>> we
>> > should find some distributed computation platform.
>> >
>> > I see SVD has been implemented and I can get eigenvalues of a matrix
>> through
>> > this API.  But when I want to get both eigenvalues and eigenvectors or
>> at
>> > least the biggest eigenvalue and the corresponding eigenvector, it seems
>> > that current Spark doesn't have such API. Is it possible that I write
>> > eigenvalue decomposition from scratch? What should I do? Thanks a lot!
>> >
>> >
>> > Miles Yao
>> >
>> > 
>> > View this message in context: How can I implement eigenvalue
>> decomposition
>> > in Spark?
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
Li
@vrilleup


Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Li Pu
I like the idea of using scala to drive the workflow. Spark already comes
with a scheduler, why not program a plugin to schedule other types of tasks
(copy file, send email, etc.)? Scala could handle any logic required by the
pipeline. Passing objects (including RDDs) between tasks is also easier. I
don't know if this is an overuse of Spark scheduler, but sounds like a good
tool. The only issue would be releasing resources that is not used at
intermediate steps.

On Fri, Jul 11, 2014 at 12:05 PM, Wei Tan  wrote:

> Just curious: how about using scala to drive the workflow? I guess if you
> use other tools (oozie, etc) you lose the advantage of reading from RDD --
> you have to read from HDFS.
>
> Best regards,
> Wei
>
> -
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> *http://researcher.ibm.com/person/us-wtan*
> 
>
>
>
> From:"k.tham" 
> To:u...@spark.incubator.apache.org,
> Date:07/10/2014 01:20 PM
> Subject:Recommended pipeline automation tool? Oozie?
> --
>
>
>
> I'm just wondering what's the general recommendation for data pipeline
> automation.
>
> Say, I want to run Spark Job A, then B, then invoke script C, then do D,
> and
> if D fails, do E, and if Job A fails, send email F, etc...
>
> It looks like Oozie might be the best choice. But I'd like some
> advice/suggestions.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
>


-- 
Li
@vrilleup


Re: running SparkALS

2014-04-28 Thread Li Pu
http://spark.apache.org/docs/0.9.0/mllib-guide.html#collaborative-filtering-1

One thing which is undocumented: the integers representing users and
items have to be positive. Otherwise it throws exceptions.

Li

On 28 avr. 2014, at 10:30, Diana Carroll  wrote:

> Hi everyone.  I'm trying to run some of the Spark example code, and most of 
> it appears to be undocumented (unless I'm missing something).  Can someone 
> help me out?
>
> I'm particularly interested in running SparkALS, which wants parameters:
> M U F iter slices
>
> What are these variables?  They appear to be integers and the default values 
> are 100, 500 and 10 respectively but beyond that...huh?
>
> Thanks!
>
> Diana