Sab, not sure what you require for the similarity metric or your use case but
you can also look at spark-rowsimilarity or spark-itemsimilarity (column-wise)
here http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
Thanks Debasish, Reza and Pat. In my case, I am doing an SVD and then doing
the similarities computation. So a rowSimiliarities() would be a good fit,
looking forward to it.
In the meanwhile I will try to see if I can further limit the number of
similarities computed through some other fashion or
Hi Sab,
The current method is optimized for having many rows and few columns. In
your case it is exactly the opposite. We are working on your case, tracked
by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823
Your case is very common, so I will put some time into building it.
In the
I am trying to compute column similarities on a 30x1000 RowMatrix of
DenseVectors. The size of the input RDD is 3.1MB and its all in one
partition. I am running on a single node of 15G and giving the driver 1G
and the executor 9G. This is on a single node hadoop. In the first attempt
the
Sorry, I actually meant 30 x 1 matrix (missed a 0)
Regards
Sab
Hi Sab,
In this dense case, the output will contain 1 x 1 entries, i.e. 100
million doubles, which doesn't fit in 1GB with overheads.
For a dense matrix, similarColumns() scales quadratically in the number of
columns, so you need more memory across the cluster.
Reza
On Sun, Mar 1, 2015
​Hi Reza
​​
I see that ((int, int), double) pairs are generated for any combination
that meets the criteria controlled by the threshold. But assuming a simple
1x10K matrix that means I would need atleast 12GB memory per executor for
the flat map just for these pairs excluding any other overhead.
Column based similarities work well if the columns are mild (10K, 100K, we
actually scaled it to 1.5M columns but it really stress tests the shuffle
and it needs to tune the shuffle parameters)...You can either use dimsum
sampling or come up with your own threshold based on your application that
Hi Sabarish,
Works fine for me with less than those settings (30x1000 dense matrix, 1GB
driver, 1GB executor):
bin/spark-shell --driver-memory 1G --executor-memory 1G
Then running the following finished without trouble and in a few seconds.
Are you sure your driver is actually getting the RAM