Thanks Ilya Is there an advantage of say partitioning by users /products when you train ? Here are two alternatives I have
#Partition by user or Product tot = newrdd.map(lambda l: (l[1],Rating(int(l[1]),int(l[2]),l[4]))).partitionBy(50).cache() ratings = tot.values() model = ALS.train(ratings, rank, numIterations) #use zipwithIndex tot = newrdd.map(lambda l: (l[1],Rating(int(l[1]),int(l[2]),l[4]))) bob = tot.zipWithIndex().map(lambda x : (x[1] ,x[0])).partitionBy(30) ratings = bob.values() model = ALS.train(ratings, rank, numIterations) On Jun 28, 2015, at 8:24 AM, Ilya Ganelin <ilgan...@gmail.com> wrote: > You can also select pieces of your RDD by first doing a zipWithIndex and then > doing a filter operation on the second element of the RDD. > > For example to select the first 100 elements : > > Val a = rdd.zipWithIndex().filter(s => 1 < s < 100) > On Sat, Jun 27, 2015 at 11:04 AM Ayman Farahat > <ayman.fara...@yahoo.com.invalid> wrote: > How do you partition by product in Python? > the only API is partitionBy(50) > > On Jun 18, 2015, at 8:42 AM, Debasish Das <debasish.da...@gmail.com> wrote: > >> Also in my experiments, it's much faster to blocked BLAS through cartesian >> rather than doing sc.union. Here are the details on the experiments: >> >> https://issues.apache.org/jira/browse/SPARK-4823 >> >> On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das <debasish.da...@gmail.com> >> wrote: >> Also not sure how threading helps here because Spark puts a partition to >> each core. On each core may be there are multiple threads if you are using >> intel hyperthreading but I will let Spark handle the threading. >> >> On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das <debasish.da...@gmail.com> >> wrote: >> We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS >> dgemm based calculation. >> >> On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat >> <ayman.fara...@yahoo.com.invalid> wrote: >> Thanks Sabarish and Nick >> Would you happen to have some code snippets that you can share. >> Best >> Ayman >> >> On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan >> <sabarish.sasidha...@manthan.com> wrote: >> >>> Nick is right. I too have implemented this way and it works just fine. In >>> my case, there can be even more products. You simply broadcast blocks of >>> products to userFeatures.mapPartitions() and BLAS multiply in there to get >>> recommendations. In my case 10K products form one block. Note that you >>> would then have to union your recommendations. And if there lots of product >>> blocks, you might also want to checkpoint once every few times. >>> >>> Regards >>> Sab >>> >>> On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath <nick.pentre...@gmail.com> >>> wrote: >>> One issue is that you broadcast the product vectors and then do a dot >>> product one-by-one with the user vector. >>> >>> You should try forming a matrix of the item vectors and doing the dot >>> product as a matrix-vector multiply which will make things a lot faster. >>> >>> Another optimisation that is avalailable on 1.4 is a recommendProducts >>> method that blockifies the factors to make use of level 3 BLAS (ie >>> matrix-matrix multiply). I am not sure if this is available in The Python >>> api yet. >>> >>> But you can do a version yourself by using mapPartitions over user factors, >>> blocking the factors into sub-matrices and doing matrix multiply with item >>> factor matrix to get scores on a block-by-block basis. >>> >>> Also as Ilya says more parallelism can help. I don't think it's so >>> necessary to do LSH with 30,000 items. >>> >>> — >>> Sent from Mailbox >>> >>> >>> On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya >>> <ilya.gane...@capitalone.com> wrote: >>> >>> Actually talk about this exact thing in a blog post here >>> http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/. >>> Keep in mind, you're actually doing a ton of math. Even with proper >>> caching and use of broadcast variables this will take a while defending on >>> the size of your cluster. To get real results you may want to look into >>> locality sensitive hashing to limit your search space and definitely look >>> into spinning up multiple threads to process your product features in >>> parallel to increase resource utilization on the cluster. >>> >>> >>> >>> Thank you, >>> Ilya Ganelin >>> >>> >>> >>> -----Original Message----- >>> From: afarahat [ayman.fara...@yahoo.com] >>> Sent: Wednesday, June 17, 2015 11:16 PM Eastern Standard Time >>> To: user@spark.apache.org >>> Subject: Matrix Multiplication and mllib.recommendation >>> >>> Hello; >>> I am trying to get predictions after running the ALS model. >>> The model works fine. In the prediction/recommendation , I have about 30 >>> ,000 products and 90 Millions users. >>> When i try the predict all it fails. >>> I have been trying to formulate the problem as a Matrix multiplication where >>> I first get the product features, broadcast them and then do a dot product. >>> Its still very slow. Any reason why >>> here is a sample code >>> >>> def doMultiply(x): >>> a = [] >>> #multiply by >>> mylen = len(pf.value) >>> for i in range(mylen) : >>> myprod = numpy.dot(x,pf.value[i][1]) >>> a.append(myprod) >>> return a >>> >>> >>> myModel = MatrixFactorizationModel.load(sc, "FlurryModelPath") >>> #I need to select which products to broadcast but lets try all >>> m1 = myModel.productFeatures().sample(False, 0.001) >>> pf = sc.broadcast(m1.collect()) >>> uf = myModel.userFeatures() >>> f1 = uf.map(lambda x : (x[0], doMultiply(x[1]))) >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> >>> The information contained in this e-mail is confidential and/or proprietary >>> to Capital One and/or its affiliates and may only be used solely in >>> performance of work or services for Capital One. The information >>> transmitted herewith is intended only for use by the individual or entity >>> to which it is addressed. If the reader of this message is not the intended >>> recipient, you are hereby notified that any review, retransmission, >>> dissemination, distribution, copying or other use of, or taking of any >>> action in reliance upon this information is strictly prohibited. If you >>> have received this communication in error, please contact the sender and >>> delete the material from your computer. >>> >>> >>> >>> >>> -- >>> >>> Architect - Big Data >>> Ph: +91 99805 99458 >>> >>> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan >>> India ICT) >>> +++ >> >> >> >> >