Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

surender kumar Wed, 11 Apr 2018 14:02:32 -0700

right, this is what I did when I said I tried to persist and create an RDD out 
of it to sample from. But how to do for each user?You have one rdd of users on 
one hand and rdd of items on the other. How to go from here? Am I missing 
something trivial?


    On Thursday, 12 April, 2018, 2:10:51 AM IST, Matteo Cossu 
<elco...@gmail.com> wrote:  
 
 Why broadcasting this list then? You should use an RDD or DataFrame. For 
example, RDD has a method sample() that returns a random sample from it.
On 11 April 2018 at 22:34, surender kumar <skiit...@yahoo.co.uk.invalid> wrote:

I'm using pySpark.I've list of 1 million items (all float values ) and 1 
million users. for each user I want to sample randomly some items from the item 
list.Broadcasting the item list results in Outofmemory error on the driver, 
tried setting driver memory till 10G.  I tried to persist this array on disk 
but I'm not able to figure out a way to read the same on the workers.
Any suggestion would be appreciated.

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

Reply via email to