Broadcasting huge array or persisting on HDFS to read on executors - both not working

surender kumar Wed, 11 Apr 2018 13:35:21 -0700

I'm using pySpark.I've list of 1 million items (all float values ) and 1 
million users. for each user I want to sample randomly some items from the item 
list.Broadcasting the item list results in Outofmemory error on the driver, 
tried setting driver memory till 10G.  I tried to persist this array on disk 
but I'm not able to figure out a way to read the same on the workers.
Any suggestion would be appreciated.

Broadcasting huge array or persisting on HDFS to read on executors - both not working

Reply via email to