The main bottleneck of current SVD implementation is on the memory of
driver node. It requires at least 5*n*k doubles in driver memory because
all right singular vectors are stored in driver memory and there are some
working memory required. So it is bounded by the smaller dimension of your
matrix
@Miles, eigen-decomposition with asymmetric matrix doesn't always give
real-value solutions, and it doesn't have the nice properties that
symmetric matrix holds. Usually you want to symmetrize your asymmetric
matrix in some way, e.g. see
http://machinelearning.wustl.edu/mlpapers/paper_files/icml200
@Miles, the latest SVD implementation in mllib is partially distributed.
Matrix-vector multiplication is computed among all workers, but the right
singular vectors are all stored in the driver. If your symmetric matrix is
n x n and you want the first k eigenvalues, you will need to fit n x k
double
I like the idea of using scala to drive the workflow. Spark already comes
with a scheduler, why not program a plugin to schedule other types of tasks
(copy file, send email, etc.)? Scala could handle any logic required by the
pipeline. Passing objects (including RDDs) between tasks is also easier.
http://spark.apache.org/docs/0.9.0/mllib-guide.html#collaborative-filtering-1
One thing which is undocumented: the integers representing users and
items have to be positive. Otherwise it throws exceptions.
Li
On 28 avr. 2014, at 10:30, Diana Carroll wrote:
> Hi everyone. I'm trying to run som