You have to use `myBroadcastVariable.value` to access the broadcasted value; see https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
On Fri, Jul 18, 2014 at 2:56 PM, Vedant Dhandhania < ved...@retentionscience.com> wrote: > Hi All, > > I am trying to broadcast a set in a PySpark script. > > I create the set like this: > > Uid_male_set = set(maleUsers.map(lambda x:x[1]).collect()) > > > Then execute this line: > > > uid_iid_iscore_tuple_GenderFlag = uid_iid_iscore.map(lambda > x:(x[0],zip(x[1],x[2]),x[0] in Uid_male_set)) > > > An error occurred while calling o104.collectPartitions. > > : org.apache.spark.SparkException: Job aborted due to stage failure: > Serialized task 1131:0 was 23503247 bytes which exceeds > spark.akka.frameSize (10485760 bytes). Consider using broadcast variables > for large values. > > > > So I tried broadcasting it: > > Uid_male_setbc = sc.broadcast(Uid_male_set) > > > >>> Uid_male_setbc > > <pyspark.broadcast.Broadcast object at 0x1ba2ed0> > > > Then I execute it line: > > > uid_iid_iscore_tuple_GenderFlag = uid_iid_iscore.map(lambda > x:(x[0],zip(x[1],x[2]),x[0] in Uid_male_setbc)) > > ile "<stdin>", line 1, in <lambda> > > TypeError: argument of type 'Broadcast' is not iterable > > [duplicate 1] > > > So I am stuck either ways, the script runs locally well on a smaller > dataset, but throws me this error. Could any one point out how to correct > this or where I am going wrong? > > Thanks > > > *Vedant Dhandhania* > > *Retention** Science* > > call: 805.574.0873 > > visit: Site <http://www.retentionscience.com/> | like: Facebook > <http://www.facebook.com/RetentionScience> | follow: Twitter > <http://twitter.com/RetentionSci> >