Thanks Ilya
Is there an advantage of say partitioning by users /products when you train ?
Here are two alternatives I have 

#Partition by user or Product 
tot = newrdd.map(lambda l: 
(l[1],Rating(int(l[1]),int(l[2]),l[4]))).partitionBy(50).cache()
ratings = tot.values()
model = ALS.train(ratings, rank, numIterations)

#use zipwithIndex

tot = newrdd.map(lambda l: (l[1],Rating(int(l[1]),int(l[2]),l[4])))
bob = tot.zipWithIndex().map(lambda x : (x[1] ,x[0])).partitionBy(30)
ratings = bob.values()
model = ALS.train(ratings, rank, numIterations)


On Jun 28, 2015, at 8:24 AM, Ilya Ganelin <ilgan...@gmail.com> wrote:

> You can also select pieces of your RDD by first doing a zipWithIndex and then 
> doing a filter operation on the second element of the RDD. 
> 
> For example to select the first 100 elements :
> 
> Val a = rdd.zipWithIndex().filter(s => 1 < s < 100)
> On Sat, Jun 27, 2015 at 11:04 AM Ayman Farahat 
> <ayman.fara...@yahoo.com.invalid> wrote:
> How do you partition by product in Python?
> the only API is partitionBy(50)
> 
> On Jun 18, 2015, at 8:42 AM, Debasish Das <debasish.da...@gmail.com> wrote:
> 
>> Also in my experiments, it's much faster to blocked BLAS through cartesian 
>> rather than doing sc.union. Here are the details on the experiments:
>> 
>> https://issues.apache.org/jira/browse/SPARK-4823
>> 
>> On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das <debasish.da...@gmail.com> 
>> wrote:
>> Also not sure how threading helps here because Spark puts a partition to 
>> each core. On each core may be there are multiple threads if you are using 
>> intel hyperthreading but I will let Spark handle the threading.  
>> 
>> On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das <debasish.da...@gmail.com> 
>> wrote:
>> We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS 
>> dgemm based calculation.
>> 
>> On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat 
>> <ayman.fara...@yahoo.com.invalid> wrote:
>> Thanks Sabarish and Nick
>> Would you happen to have some code snippets that you can share. 
>> Best
>> Ayman
>> 
>> On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan 
>> <sabarish.sasidha...@manthan.com> wrote:
>> 
>>> Nick is right. I too have implemented this way and it works just fine. In 
>>> my case, there can be even more products. You simply broadcast blocks of 
>>> products to userFeatures.mapPartitions() and BLAS multiply in there to get 
>>> recommendations. In my case 10K products form one block. Note that you 
>>> would then have to union your recommendations. And if there lots of product 
>>> blocks, you might also want to checkpoint once every few times.
>>> 
>>> Regards
>>> Sab
>>> 
>>> On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath <nick.pentre...@gmail.com> 
>>> wrote:
>>> One issue is that you broadcast the product vectors and then do a dot 
>>> product one-by-one with the user vector.
>>> 
>>> You should try forming a matrix of the item vectors and doing the dot 
>>> product as a matrix-vector multiply which will make things a lot faster.
>>> 
>>> Another optimisation that is avalailable on 1.4 is a recommendProducts 
>>> method that blockifies the factors to make use of level 3 BLAS (ie 
>>> matrix-matrix multiply). I am not sure if this is available in The Python 
>>> api yet. 
>>> 
>>> But you can do a version yourself by using mapPartitions over user factors, 
>>> blocking the factors into sub-matrices and doing matrix multiply with item 
>>> factor matrix to get scores on a block-by-block basis.
>>> 
>>> Also as Ilya says more parallelism can help. I don't think it's so 
>>> necessary to do LSH with 30,000 items.
>>> 
>>> —
>>> Sent from Mailbox
>>> 
>>> 
>>> On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya 
>>> <ilya.gane...@capitalone.com> wrote:
>>> 
>>> Actually talk about this exact thing in a blog post here 
>>> http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
>>>  Keep in mind, you're actually doing a ton of math. Even with proper 
>>> caching and use of broadcast variables this will take a while defending on 
>>> the size of your cluster. To get real results you may want to look into 
>>> locality sensitive hashing to limit your search space and definitely look 
>>> into spinning up multiple threads to process your product features in 
>>> parallel to increase resource utilization on the cluster.
>>> 
>>> 
>>> 
>>> Thank you,
>>> Ilya Ganelin
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: afarahat [ayman.fara...@yahoo.com]
>>> Sent: Wednesday, June 17, 2015 11:16 PM Eastern Standard Time
>>> To: user@spark.apache.org
>>> Subject: Matrix Multiplication and mllib.recommendation
>>> 
>>> Hello;
>>> I am trying to get predictions after running the ALS model.
>>> The model works fine. In the prediction/recommendation , I have about 30
>>> ,000 products and 90 Millions users.
>>> When i try the predict all it fails.
>>> I have been trying to formulate the problem as a Matrix multiplication where
>>> I first get the product features, broadcast them and then do a dot product.
>>> Its still very slow. Any reason why
>>> here is a sample code
>>> 
>>> def doMultiply(x):
>>>         a = []
>>>         #multiply by
>>>         mylen = len(pf.value)
>>>         for i in range(mylen) :
>>>           myprod = numpy.dot(x,pf.value[i][1])
>>>           a.append(myprod)
>>>         return a
>>> 
>>> 
>>> myModel = MatrixFactorizationModel.load(sc, "FlurryModelPath")
>>> #I need to select which products to broadcast but lets try all
>>> m1 = myModel.productFeatures().sample(False, 0.001)
>>> pf = sc.broadcast(m1.collect())
>>> uf = myModel.userFeatures()
>>> f1 = uf.map(lambda x : (x[0], doMultiply(x[1])))
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>> 
>>> 
>>> 
>>> The information contained in this e-mail is confidential and/or proprietary 
>>> to Capital One and/or its affiliates and may only be used solely in 
>>> performance of work or services for Capital One. The information 
>>> transmitted herewith is intended only for use by the individual or entity 
>>> to which it is addressed. If the reader of this message is not the intended 
>>> recipient, you are hereby notified that any review, retransmission, 
>>> dissemination, distribution, copying or other use of, or taking of any 
>>> action in reliance upon this information is strictly prohibited. If you 
>>> have received this communication in error, please contact the sender and 
>>> delete the material from your computer.
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> 
>>> Architect - Big Data
>>> Ph: +91 99805 99458
>>> 
>>> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
>>> India ICT)
>>> +++
>> 
>> 
>> 
>> 
> 

Reply via email to