I was wondering if anyone could help with this question.

On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, <dhruba.w...@gmail.com>
wrote:

> Hi,
>
> I have a question regarding passing a dictionary from driver to executors
> in spark on yarn. This dictionary is needed in an udf. I am using pyspark.
>
> As I understand this can be passed in two ways:
>
> 1. Broadcast the variable and then use it in the udfs
>
> 2. Pass the dictionary in the udf itself, in something like this:
>
>   def udf1(col1, dict):
>    ..
>   def udf1_fn(dict):
>     return udf(lambda col_data: udf1(col_data, dict))
>
>   df.withColumn("column_new", udf1_fn(dict)("old_column"))
>
> Well I have tested with both the ways and it works both ways.
>
> Now I am wondering what is fundamentally different between the two. I
> understand how broadcast work but I am not sure how the data is passed
> across in the 2nd way. Is the dictionary passed to each executor every time
> when new task is running on that executor or they are passed only once.
> Also how the data is passed to the python processes. They are python udfs
> so I think they are executed natively in python.(Plz correct me if I am
> wrong). So the data will be serialised and passed to python.
>
> So in summary my question is which will be better/efficient way to write
> the whole thing and why?
>
> Thank you!
>
> Regards,
> Dhrub
>

Reply via email to