Broadcasting a set in PySpark

Vedant Dhandhania Fri, 18 Jul 2014 14:57:12 -0700

Hi All,

I am trying to broadcast a set in a PySpark script.


I create the set like this:

Uid_male_set = set(maleUsers.map(lambda x:x[1]).collect())


Then execute this line:


uid_iid_iscore_tuple_GenderFlag = uid_iid_iscore.map(lambda
x:(x[0],zip(x[1],x[2]),x[0] in Uid_male_set))


 An error occurred while calling o104.collectPartitions.

: org.apache.spark.SparkException: Job aborted due to stage failure:
Serialized task 1131:0 was 23503247 bytes which exceeds
spark.akka.frameSize (10485760 bytes). Consider using broadcast variables
for large values.



So I tried broadcasting it:

Uid_male_setbc = sc.broadcast(Uid_male_set)


>>> Uid_male_setbc

<pyspark.broadcast.Broadcast object at 0x1ba2ed0>


Then I execute it line:


uid_iid_iscore_tuple_GenderFlag = uid_iid_iscore.map(lambda
x:(x[0],zip(x[1],x[2]),x[0] in Uid_male_setbc))

ile "<stdin>", line 1, in <lambda>

TypeError: argument of type 'Broadcast' is not iterable

 [duplicate 1]


So I am stuck either ways, the script runs locally well on a smaller
dataset, but throws me this error. Could any one point out how to correct
this or where I am going wrong?

Thanks


*Vedant Dhandhania*

*Retention** Science*

call: 805.574.0873

visit: Site <http://www.retentionscience.com/> | like: Facebook
<http://www.facebook.com/RetentionScience> | follow: Twitter
<http://twitter.com/RetentionSci>

Broadcasting a set in PySpark

Reply via email to