Oops - code should be :

Val a = rdd.zipWithIndex().filter(s => 1 < s._2 < 100)

On Sun, Jun 28, 2015 at 8:24 AM Ilya Ganelin <ilgan...@gmail.com> wrote:

> You can also select pieces of your RDD by first doing a zipWithIndex and
> then doing a filter operation on the second element of the RDD.
>
> For example to select the first 100 elements :
>
> Val a = rdd.zipWithIndex().filter(s => 1 < s < 100)
> On Sat, Jun 27, 2015 at 11:04 AM Ayman Farahat
> <ayman.fara...@yahoo.com.invalid> wrote:
>
>> How do you partition by product in Python?
>> the only API is partitionBy(50)
>>
>> On Jun 18, 2015, at 8:42 AM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>> Also in my experiments, it's much faster to blocked BLAS through
>> cartesian rather than doing sc.union. Here are the details on the
>> experiments:
>>
>> https://issues.apache.org/jira/browse/SPARK-4823
>>
>> On Thu, Jun 18, 2015 at 8:40 AM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> Also not sure how threading helps here because Spark puts a partition to
>>> each core. On each core may be there are multiple threads if you are using
>>> intel hyperthreading but I will let Spark handle the threading.
>>>
>>> On Thu, Jun 18, 2015 at 8:38 AM, Debasish Das <debasish.da...@gmail.com>
>>> wrote:
>>>
>>>> We added SPARK-3066 for this. In 1.4 you should get the code to do BLAS
>>>> dgemm based calculation.
>>>>
>>>> On Thu, Jun 18, 2015 at 8:20 AM, Ayman Farahat <
>>>> ayman.fara...@yahoo.com.invalid> wrote:
>>>>
>>>>> Thanks Sabarish and Nick
>>>>> Would you happen to have some code snippets that you can share.
>>>>> Best
>>>>> Ayman
>>>>>
>>>>> On Jun 17, 2015, at 10:35 PM, Sabarish Sasidharan <
>>>>> sabarish.sasidha...@manthan.com> wrote:
>>>>>
>>>>> Nick is right. I too have implemented this way and it works just fine.
>>>>> In my case, there can be even more products. You simply broadcast blocks 
>>>>> of
>>>>> products to userFeatures.mapPartitions() and BLAS multiply in there to get
>>>>> recommendations. In my case 10K products form one block. Note that you
>>>>> would then have to union your recommendations. And if there lots of 
>>>>> product
>>>>> blocks, you might also want to checkpoint once every few times.
>>>>>
>>>>> Regards
>>>>> Sab
>>>>>
>>>>> On Thu, Jun 18, 2015 at 10:43 AM, Nick Pentreath <
>>>>> nick.pentre...@gmail.com> wrote:
>>>>>
>>>>>> One issue is that you broadcast the product vectors and then do a dot
>>>>>> product one-by-one with the user vector.
>>>>>>
>>>>>> You should try forming a matrix of the item vectors and doing the dot
>>>>>> product as a matrix-vector multiply which will make things a lot faster.
>>>>>>
>>>>>> Another optimisation that is avalailable on 1.4 is a
>>>>>> recommendProducts method that blockifies the factors to make use of 
>>>>>> level 3
>>>>>> BLAS (ie matrix-matrix multiply). I am not sure if this is available in 
>>>>>> The
>>>>>> Python api yet.
>>>>>>
>>>>>> But you can do a version yourself by using mapPartitions over user
>>>>>> factors, blocking the factors into sub-matrices and doing matrix multiply
>>>>>> with item factor matrix to get scores on a block-by-block basis.
>>>>>>
>>>>>> Also as Ilya says more parallelism can help. I don't think it's so
>>>>>> necessary to do LSH with 30,000 items.
>>>>>>
>>>>>> —
>>>>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya <
>>>>>> ilya.gane...@capitalone.com> wrote:
>>>>>>
>>>>>>> Actually talk about this exact thing in a blog post here
>>>>>>> http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
>>>>>>> Keep in mind, you're actually doing a ton of math. Even with proper 
>>>>>>> caching
>>>>>>> and use of broadcast variables this will take a while defending on the 
>>>>>>> size
>>>>>>> of your cluster. To get real results you may want to look into locality
>>>>>>> sensitive hashing to limit your search space and definitely look into
>>>>>>> spinning up multiple threads to process your product features in 
>>>>>>> parallel
>>>>>>> to increase resource utilization on the cluster.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Ilya Ganelin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> *From: *afarahat [ayman.fara...@yahoo.com]
>>>>>>> *Sent: *Wednesday, June 17, 2015 11:16 PM Eastern Standard Time
>>>>>>> *To: *user@spark.apache.org
>>>>>>> *Subject: *Matrix Multiplication and mllib.recommendation
>>>>>>>
>>>>>>> Hello;
>>>>>>> I am trying to get predictions after running the ALS model.
>>>>>>> The model works fine. In the prediction/recommendation , I have
>>>>>>> about 30
>>>>>>> ,000 products and 90 Millions users.
>>>>>>> When i try the predict all it fails.
>>>>>>> I have been trying to formulate the problem as a Matrix
>>>>>>> multiplication where
>>>>>>> I first get the product features, broadcast them and then do a dot
>>>>>>> product.
>>>>>>> Its still very slow. Any reason why
>>>>>>> here is a sample code
>>>>>>>
>>>>>>> def doMultiply(x):
>>>>>>>         a = []
>>>>>>>         #multiply by
>>>>>>>         mylen = len(pf.value)
>>>>>>>         for i in range(mylen) :
>>>>>>>           myprod = numpy.dot(x,pf.value[i][1])
>>>>>>>           a.append(myprod)
>>>>>>>         return a
>>>>>>>
>>>>>>>
>>>>>>> myModel = MatrixFactorizationModel.load(sc, "FlurryModelPath")
>>>>>>> #I need to select which products to broadcast but lets try all
>>>>>>> m1 = myModel.productFeatures().sample(False, 0.001)
>>>>>>> pf = sc.broadcast(m1.collect())
>>>>>>> uf = myModel.userFeatures()
>>>>>>> f1 = uf.map(lambda x : (x[0], doMultiply(x[1])))
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com <http://nabble.com/>.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> The information contained in this e-mail is confidential and/or
>>>>>>> proprietary to Capital One and/or its affiliates and may only be used
>>>>>>> solely in performance of work or services for Capital One. The 
>>>>>>> information
>>>>>>> transmitted herewith is intended only for use by the individual or 
>>>>>>> entity
>>>>>>> to which it is addressed. If the reader of this message is not the 
>>>>>>> intended
>>>>>>> recipient, you are hereby notified that any review, retransmission,
>>>>>>> dissemination, distribution, copying or other use of, or taking of any
>>>>>>> action in reliance upon this information is strictly prohibited. If you
>>>>>>> have received this communication in error, please contact the sender and
>>>>>>> delete the material from your computer.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Architect - Big Data
>>>>> Ph: +91 99805 99458
>>>>>
>>>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>>>>> Sullivan India ICT)*
>>>>> +++
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>

Reply via email to