Re: SVD on larger than taller matrix

2014-09-18 Thread Li Pu
The main bottleneck of current SVD implementation is on the memory of driver node. It requires at least 5*n*k doubles in driver memory because all right singular vectors are stored in driver memory and there are some working memory required. So it is bounded by the smaller dimension of your matrix

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-08 Thread Li Pu
@Miles, eigen-decomposition with asymmetric matrix doesn't always give real-value solutions, and it doesn't have the nice properties that symmetric matrix holds. Usually you want to symmetrize your asymmetric matrix in some way, e.g. see http://machinelearning.wustl.edu/mlpapers/paper_files/icml200

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Li Pu
@Miles, the latest SVD implementation in mllib is partially distributed. Matrix-vector multiplication is computed among all workers, but the right singular vectors are all stored in the driver. If your symmetric matrix is n x n and you want the first k eigenvalues, you will need to fit n x k double

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Li Pu
I like the idea of using scala to drive the workflow. Spark already comes with a scheduler, why not program a plugin to schedule other types of tasks (copy file, send email, etc.)? Scala could handle any logic required by the pipeline. Passing objects (including RDDs) between tasks is also easier.

Re: running SparkALS

2014-04-28 Thread Li Pu
http://spark.apache.org/docs/0.9.0/mllib-guide.html#collaborative-filtering-1 One thing which is undocumented: the integers representing users and items have to be positive. Otherwise it throws exceptions. Li On 28 avr. 2014, at 10:30, Diana Carroll wrote: > Hi everyone. I'm trying to run som