Re: Broadcasting a set in PySpark

Josh Rosen Fri, 18 Jul 2014 15:11:32 -0700

You have to use `myBroadcastVariable.value` to access the broadcasted
value; see
https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables



On Fri, Jul 18, 2014 at 2:56 PM, Vedant Dhandhania <
ved...@retentionscience.com> wrote:

> Hi All,
>
> I am trying to broadcast a set in a PySpark script.
>
> I create the set like this:
>
> Uid_male_set = set(maleUsers.map(lambda x:x[1]).collect())
>
>
> Then execute this line:
>
>
> uid_iid_iscore_tuple_GenderFlag = uid_iid_iscore.map(lambda
> x:(x[0],zip(x[1],x[2]),x[0] in Uid_male_set))
>
>
>  An error occurred while calling o104.collectPartitions.
>
> : org.apache.spark.SparkException: Job aborted due to stage failure:
> Serialized task 1131:0 was 23503247 bytes which exceeds
> spark.akka.frameSize (10485760 bytes). Consider using broadcast variables
> for large values.
>
>
>
> So I tried broadcasting it:
>
> Uid_male_setbc = sc.broadcast(Uid_male_set)
>
>
> >>> Uid_male_setbc
>
> <pyspark.broadcast.Broadcast object at 0x1ba2ed0>
>
>
> Then I execute it line:
>
>
> uid_iid_iscore_tuple_GenderFlag = uid_iid_iscore.map(lambda
> x:(x[0],zip(x[1],x[2]),x[0] in Uid_male_setbc))
>
> ile "<stdin>", line 1, in <lambda>
>
> TypeError: argument of type 'Broadcast' is not iterable
>
>  [duplicate 1]
>
>
> So I am stuck either ways, the script runs locally well on a smaller
> dataset, but throws me this error. Could any one point out how to correct
> this or where I am going wrong?
>
> Thanks
>
>
> *Vedant Dhandhania*
>
> *Retention** Science*
>
> call: 805.574.0873
>
> visit: Site <http://www.retentionscience.com/> | like: Facebook
> <http://www.facebook.com/RetentionScience> | follow: Twitter
> <http://twitter.com/RetentionSci>
>

Re: Broadcasting a set in PySpark

Reply via email to