Hi All, I am trying to broadcast a set in a PySpark script.
I create the set like this: Uid_male_set = set(maleUsers.map(lambda x:x[1]).collect()) Then execute this line: uid_iid_iscore_tuple_GenderFlag = uid_iid_iscore.map(lambda x:(x[0],zip(x[1],x[2]),x[0] in Uid_male_set)) An error occurred while calling o104.collectPartitions. : org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 1131:0 was 23503247 bytes which exceeds spark.akka.frameSize (10485760 bytes). Consider using broadcast variables for large values. So I tried broadcasting it: Uid_male_setbc = sc.broadcast(Uid_male_set) >>> Uid_male_setbc <pyspark.broadcast.Broadcast object at 0x1ba2ed0> Then I execute it line: uid_iid_iscore_tuple_GenderFlag = uid_iid_iscore.map(lambda x:(x[0],zip(x[1],x[2]),x[0] in Uid_male_setbc)) ile "<stdin>", line 1, in <lambda> TypeError: argument of type 'Broadcast' is not iterable [duplicate 1] So I am stuck either ways, the script runs locally well on a smaller dataset, but throws me this error. Could any one point out how to correct this or where I am going wrong? Thanks *Vedant Dhandhania* *Retention** Science* call: 805.574.0873 visit: Site <http://www.retentionscience.com/> | like: Facebook <http://www.facebook.com/RetentionScience> | follow: Twitter <http://twitter.com/RetentionSci>