You can also select pieces of your RDD by first doing a zipWithIndex and then doing a filter operation on the second element of the RDD.
For example to select the first 100 elements : Val a = rdd.zipWithIndex().filter(s => 1 < s < 100) On Sat, Jun 27, 2015 at 11:04 AM Ayman Farahat <ayman.fara...@yahoo.com.invalid> wrote: > How do you partition by product in Python? > the only API is partitionBy(50) > > On Jun 18, 2015, at 8:42 AM, Debasish Das <debasish.da...@gmail.com> > wrote: > > Also in my experiments, it's much faster to blocked BLAS through cartesian > rather than doing sc.union. Here are the details on the experiments: > > https://issues.apache.org/jira/browse/SPARK-4823 > > On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das <debasish.da...@gmail.com> > wrote: > >> Also not sure how threading helps here because Spark puts a partition to >> each core. On each core may be there are multiple threads if you are using >> intel hyperthreading but I will let Spark handle the threading. >> >> On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das <debasish.da...@gmail.com> >> wrote: >> >>> We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS >>> dgemm based calculation. >>> >>> On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat < >>> ayman.fara...@yahoo.com.invalid> wrote: >>> >>>> Thanks Sabarish and Nick >>>> Would you happen to have some code snippets that you can share. >>>> Best >>>> Ayman >>>> >>>> On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan < >>>> sabarish.sasidha...@manthan.com> wrote: >>>> >>>> Nick is right. I too have implemented this way and it works just fine. >>>> In my case, there can be even more products. You simply broadcast blocks of >>>> products to userFeatures.mapPartitions() and BLAS multiply in there to get >>>> recommendations. In my case 10K products form one block. Note that you >>>> would then have to union your recommendations. And if there lots of product >>>> blocks, you might also want to checkpoint once every few times. >>>> >>>> Regards >>>> Sab >>>> >>>> On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath < >>>> nick.pentre...@gmail.com> wrote: >>>> >>>>> One issue is that you broadcast the product vectors and then do a dot >>>>> product one-by-one with the user vector. >>>>> >>>>> You should try forming a matrix of the item vectors and doing the dot >>>>> product as a matrix-vector multiply which will make things a lot faster. >>>>> >>>>> Another optimisation that is avalailable on 1.4 is a recommendProducts >>>>> method that blockifies the factors to make use of level 3 BLAS (ie >>>>> matrix-matrix multiply). I am not sure if this is available in The Python >>>>> api yet. >>>>> >>>>> But you can do a version yourself by using mapPartitions over user >>>>> factors, blocking the factors into sub-matrices and doing matrix multiply >>>>> with item factor matrix to get scores on a block-by-block basis. >>>>> >>>>> Also as Ilya says more parallelism can help. I don't think it's so >>>>> necessary to do LSH with 30,000 items. >>>>> >>>>> — >>>>> Sent from Mailbox <https://www.dropbox.com/mailbox> >>>>> >>>>> >>>>> On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya < >>>>> ilya.gane...@capitalone.com> wrote: >>>>> >>>>>> Actually talk about this exact thing in a blog post here >>>>>> http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/. >>>>>> Keep in mind, you're actually doing a ton of math. Even with proper >>>>>> caching >>>>>> and use of broadcast variables this will take a while defending on the >>>>>> size >>>>>> of your cluster. To get real results you may want to look into locality >>>>>> sensitive hashing to limit your search space and definitely look into >>>>>> spinning up multiple threads to process your product features in parallel >>>>>> to increase resource utilization on the cluster. >>>>>> >>>>>> >>>>>> >>>>>> Thank you, >>>>>> Ilya Ganelin >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> *From: *afarahat [ayman.fara...@yahoo.com] >>>>>> *Sent: *Wednesday, June 17, 2015 11:16 PM Eastern Standard Time >>>>>> *To: *user@spark.apache.org >>>>>> *Subject: *Matrix Multiplication and mllib.recommendation >>>>>> >>>>>> Hello; >>>>>> I am trying to get predictions after running the ALS model. >>>>>> The model works fine. In the prediction/recommendation , I have about >>>>>> 30 >>>>>> ,000 products and 90 Millions users. >>>>>> When i try the predict all it fails. >>>>>> I have been trying to formulate the problem as a Matrix >>>>>> multiplication where >>>>>> I first get the product features, broadcast them and then do a dot >>>>>> product. >>>>>> Its still very slow. Any reason why >>>>>> here is a sample code >>>>>> >>>>>> def doMultiply(x): >>>>>> a = [] >>>>>> #multiply by >>>>>> mylen = len(pf.value) >>>>>> for i in range(mylen) : >>>>>> myprod = numpy.dot(x,pf.value[i][1]) >>>>>> a.append(myprod) >>>>>> return a >>>>>> >>>>>> >>>>>> myModel = MatrixFactorizationModel.load(sc, "FlurryModelPath") >>>>>> #I need to select which products to broadcast but lets try all >>>>>> m1 = myModel.productFeatures().sample(False, 0.001) >>>>>> pf = sc.broadcast(m1.collect()) >>>>>> uf = myModel.userFeatures() >>>>>> f1 = uf.map(lambda x : (x[0], doMultiply(x[1]))) >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com <http://nabble.com/>. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> The information contained in this e-mail is confidential and/or >>>>>> proprietary to Capital One and/or its affiliates and may only be used >>>>>> solely in performance of work or services for Capital One. The >>>>>> information >>>>>> transmitted herewith is intended only for use by the individual or entity >>>>>> to which it is addressed. If the reader of this message is not the >>>>>> intended >>>>>> recipient, you are hereby notified that any review, retransmission, >>>>>> dissemination, distribution, copying or other use of, or taking of any >>>>>> action in reliance upon this information is strictly prohibited. If you >>>>>> have received this communication in error, please contact the sender and >>>>>> delete the material from your computer. >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Architect - Big Data >>>> Ph: +91 99805 99458 >>>> >>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and >>>> Sullivan India ICT)* >>>> +++ >>>> >>>> >>>> >>> >> > >