Ayman - it's really a question of recommending user to products vs products
to users. There will only be a difference if you're not doing All to All.
For example, if you're recommending only the Top N recommendations. Then
you may recommend only the top N products or the top N users which would be
Oops - code should be :
Val a = rdd.zipWithIndex().filter(s = 1 s._2 100)
On Sun, Jun 28, 2015 at 8:24 AM Ilya Ganelin ilgan...@gmail.com wrote:
You can also select pieces of your RDD by first doing a zipWithIndex and
then doing a filter operation on the second element of the RDD.
For
You can also select pieces of your RDD by first doing a zipWithIndex and
then doing a filter operation on the second element of the RDD.
For example to select the first 100 elements :
Val a = rdd.zipWithIndex().filter(s = 1 s 100)
On Sat, Jun 27, 2015 at 11:04 AM Ayman Farahat
Thanks Ilya
Is there an advantage of say partitioning by users /products when you train ?
Here are two alternatives I have
#Partition by user or Product
tot = newrdd.map(lambda l:
(l[1],Rating(int(l[1]),int(l[2]),l[4]))).partitionBy(50).cache()
ratings = tot.values()
model = ALS.train(ratings,
How do you partition by product in Python?
the only API is partitionBy(50)
On Jun 18, 2015, at 8:42 AM, Debasish Das debasish.da...@gmail.com wrote:
Also in my experiments, it's much faster to blocked BLAS through cartesian
rather than doing sc.union. Here are the details on the experiments:
Thanks Sabarish and Nick
Would you happen to have some code snippets that you can share.
Best
Ayman
On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan
sabarish.sasidha...@manthan.com wrote:
Nick is right. I too have implemented this way and it works just fine. In my
case, there can be even
Also not sure how threading helps here because Spark puts a partition to
each core. On each core may be there are multiple threads if you are using
intel hyperthreading but I will let Spark handle the threading.
On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das debasish.da...@gmail.com
wrote:
We
We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
dgemm based calculation.
On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat
ayman.fara...@yahoo.com.invalid wrote:
Thanks Sabarish and Nick
Would you happen to have some code snippets that you can share.
Best
Ayman
On Jun
Also in my experiments, it's much faster to blocked BLAS through cartesian
rather than doing sc.union. Here are the details on the experiments:
https://issues.apache.org/jira/browse/SPARK-4823
On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das debasish.da...@gmail.com
wrote:
Also not sure how
Thanks all for the help.
It turned out that using the bumpy matrix multiplication made a huge difference
in performance. I suspect that Numpy already uses BLAS optimized code.
Here is Python code
#This is where i load and directly test the predictions
myModel =
Yup, numpy calls into BLAS for matrix multiply.
Sent from my iPad
On 18 Jun 2015, at 8:54 PM, Ayman Farahat ayman.fara...@yahoo.com wrote:
Thanks all for the help.
It turned out that using the bumpy matrix multiplication made a huge
difference in performance. I suspect that Numpy already
Nick is right. I too have implemented this way and it works just fine. In
my case, there can be even more products. You simply broadcast blocks of
products to userFeatures.mapPartitions() and BLAS multiply in there to get
recommendations. In my case 10K products form one block. Note that you
would
Actually talk about this exact thing in a blog post here
http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
Keep in mind, you're actually doing a ton of math. Even with proper caching
and use of broadcast variables this will
One issue is that you broadcast the product vectors and then do a dot product
one-by-one with the user vector.
You should try forming a matrix of the item vectors and doing the dot product
as a matrix-vector multiply which will make things a lot faster.
Another optimisation that is
14 matches
Mail list logo