Please let me know whether using any of these built in functions helps or not.
Regards,Gourav
On Thu, Apr 12, 2018 at 3:25 AM, surender kumar <skiit...@yahoo.co.uk.invalid>
wrote:
Thanks Matteo, this should work!
-Surender
On Thursday, 12 April, 2018, 1:13:38 PM IST, Matteo
andint()), so you have a new
RDD (userID, [sample_items])
- flatten all the list in the previously created RDD and join them back with
the RDD with (itemID, index) using index as join attribute
You can do the same things with DataFrame using UDFs.
On 11 April 2018 at 23:01, surender kumar
AM IST, Matteo Cossu
<elco...@gmail.com> wrote:
Why broadcasting this list then? You should use an RDD or DataFrame. For
example, RDD has a method sample() that returns a random sample from it.
On 11 April 2018 at 22:34, surender kumar <skiit...@yahoo.co.uk.invalid> wrote
I'm using pySpark.I've list of 1 million items (all float values ) and 1
million users. for each user I want to sample randomly some items from the item
list.Broadcasting the item list results in Outofmemory error on the driver,
tried setting driver memory till 10G. I tried to persist this