Ayman - it's really a question of recommending user to products vs products to users. There will only be a difference if you're not doing All to All. For example, if you're recommending only the Top N recommendations. Then you may recommend only the top N products or the top N users which would be different. On Sun, Jun 28, 2015 at 8:34 AM Ayman Farahat <ayman.fara...@yahoo.com> wrote:
> Thanks Ilya > Is there an advantage of say partitioning by users /products when you > train ? > Here are two alternatives I have > > #Partition by user or Product > tot = newrdd.map(lambda l: > (l[1],Rating(int(l[1]),int(l[2]),l[4]))).partitionBy(50).cache() > ratings = tot.values() > model = ALS.train(ratings, rank, numIterations) > > #use zipwithIndex > > tot = newrdd.map(lambda l: (l[1],Rating(int(l[1]),int(l[2]),l[4]))) > bob = tot.zipWithIndex().map(lambda x : (x[1] ,x[0])).partitionBy(30) > ratings = bob.values() > model = ALS.train(ratings, rank, numIterations) > > > On Jun 28, 2015, at 8:24 AM, Ilya Ganelin <ilgan...@gmail.com> wrote: > > You can also select pieces of your RDD by first doing a zipWithIndex and > then doing a filter operation on the second element of the RDD. > > For example to select the first 100 elements : > > Val a = rdd.zipWithIndex().filter(s => 1 < s < 100) > On Sat, Jun 27, 2015 at 11:04 AM Ayman Farahat < > ayman.fara...@yahoo.com.invalid> wrote: > >> How do you partition by product in Python? >> the only API is partitionBy(50) >> >> On Jun 18, 2015, at 8:42 AM, Debasish Das <debasish.da...@gmail.com> >> wrote: >> >> Also in my experiments, it's much faster to blocked BLAS through >> cartesian rather than doing sc.union. Here are the details on the >> experiments: >> >> https://issues.apache.org/jira/browse/SPARK-4823 >> >> On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das <debasish.da...@gmail.com> >> wrote: >> >>> Also not sure how threading helps here because Spark puts a partition to >>> each core. On each core may be there are multiple threads if you are using >>> intel hyperthreading but I will let Spark handle the threading. >>> >>> On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das <debasish.da...@gmail.com> >>> wrote: >>> >>>> We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS >>>> dgemm based calculation. >>>> >>>> On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat < >>>> ayman.fara...@yahoo.com.invalid> wrote: >>>> >>>>> Thanks Sabarish and Nick >>>>> Would you happen to have some code snippets that you can share. >>>>> Best >>>>> Ayman >>>>> >>>>> On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan < >>>>> sabarish.sasidha...@manthan.com> wrote: >>>>> >>>>> Nick is right. I too have implemented this way and it works just fine. >>>>> In my case, there can be even more products. You simply broadcast blocks >>>>> of >>>>> products to userFeatures.mapPartitions() and BLAS multiply in there to get >>>>> recommendations. In my case 10K products form one block. Note that you >>>>> would then have to union your recommendations. And if there lots of >>>>> product >>>>> blocks, you might also want to checkpoint once every few times. >>>>> >>>>> Regards >>>>> Sab >>>>> >>>>> On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath < >>>>> nick.pentre...@gmail.com> wrote: >>>>> >>>>>> One issue is that you broadcast the product vectors and then do a dot >>>>>> product one-by-one with the user vector. >>>>>> >>>>>> You should try forming a matrix of the item vectors and doing the dot >>>>>> product as a matrix-vector multiply which will make things a lot faster. >>>>>> >>>>>> Another optimisation that is avalailable on 1.4 is a >>>>>> recommendProducts method that blockifies the factors to make use of >>>>>> level 3 >>>>>> BLAS (ie matrix-matrix multiply). I am not sure if this is available in >>>>>> The >>>>>> Python api yet. >>>>>> >>>>>> But you can do a version yourself by using mapPartitions over user >>>>>> factors, blocking the factors into sub-matrices and doing matrix multiply >>>>>> with item factor matrix to get scores on a block-by-block basis. >>>>>> >>>>>> Also as Ilya says more parallelism can help. I don't think it's so >>>>>> necessary to do LSH with 30,000 items. >>>>>> >>>>>> — >>>>>> Sent from Mailbox <https://www.dropbox.com/mailbox> >>>>>> >>>>>> >>>>>> On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya < >>>>>> ilya.gane...@capitalone.com> wrote: >>>>>> >>>>>>> Actually talk about this exact thing in a blog post here >>>>>>> http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/. >>>>>>> Keep in mind, you're actually doing a ton of math. Even with proper >>>>>>> caching >>>>>>> and use of broadcast variables this will take a while defending on the >>>>>>> size >>>>>>> of your cluster. To get real results you may want to look into locality >>>>>>> sensitive hashing to limit your search space and definitely look into >>>>>>> spinning up multiple threads to process your product features in >>>>>>> parallel >>>>>>> to increase resource utilization on the cluster. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thank you, >>>>>>> Ilya Ganelin >>>>>>> >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> *From: *afarahat [ayman.fara...@yahoo.com] >>>>>>> *Sent: *Wednesday, June 17, 2015 11:16 PM Eastern Standard Time >>>>>>> *To: *user@spark.apache.org >>>>>>> *Subject: *Matrix Multiplication and mllib.recommendation >>>>>>> >>>>>>> Hello; >>>>>>> I am trying to get predictions after running the ALS model. >>>>>>> The model works fine. In the prediction/recommendation , I have >>>>>>> about 30 >>>>>>> ,000 products and 90 Millions users. >>>>>>> When i try the predict all it fails. >>>>>>> I have been trying to formulate the problem as a Matrix >>>>>>> multiplication where >>>>>>> I first get the product features, broadcast them and then do a dot >>>>>>> product. >>>>>>> Its still very slow. Any reason why >>>>>>> here is a sample code >>>>>>> >>>>>>> def doMultiply(x): >>>>>>> a = [] >>>>>>> #multiply by >>>>>>> mylen = len(pf.value) >>>>>>> for i in range(mylen) : >>>>>>> myprod = numpy.dot(x,pf.value[i][1]) >>>>>>> a.append(myprod) >>>>>>> return a >>>>>>> >>>>>>> >>>>>>> myModel = MatrixFactorizationModel.load(sc, "FlurryModelPath") >>>>>>> #I need to select which products to broadcast but lets try all >>>>>>> m1 = myModel.productFeatures().sample(False, 0.001) >>>>>>> pf = sc.broadcast(m1.collect()) >>>>>>> uf = myModel.userFeatures() >>>>>>> f1 = uf.map(lambda x : (x[0], doMultiply(x[1]))) >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: >>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html >>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>> Nabble.com <http://nabble.com/>. >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>> >>>>>>> >>>>>>> ------------------------------ >>>>>>> The information contained in this e-mail is confidential and/or >>>>>>> proprietary to Capital One and/or its affiliates and may only be used >>>>>>> solely in performance of work or services for Capital One. The >>>>>>> information >>>>>>> transmitted herewith is intended only for use by the individual or >>>>>>> entity >>>>>>> to which it is addressed. If the reader of this message is not the >>>>>>> intended >>>>>>> recipient, you are hereby notified that any review, retransmission, >>>>>>> dissemination, distribution, copying or other use of, or taking of any >>>>>>> action in reliance upon this information is strictly prohibited. If you >>>>>>> have received this communication in error, please contact the sender and >>>>>>> delete the material from your computer. >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Architect - Big Data >>>>> Ph: +91 99805 99458 >>>>> >>>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and >>>>> Sullivan India ICT)* >>>>> +++ >>>>> >>>>> >>>>> >>>> >>> >> >> >