A while ago we changed it so the task gets broadcasted too, so I think the two
are fairly similar.
On Mon, Sep 23, 2019 at 8:17 PM, Dhrubajyoti Hati < dhruba.w...@gmail.com >
wrote:
>
> I was wondering if anyone could help with this question.
>
> On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, < dhruba. work@ gmail. com
> ( dhruba.w...@gmail.com ) > wrote:
>
>
>> Hi,
>>
>>
>> I have a question regarding passing a dictionary from driver to executors
>> in spark on yarn. This dictionary is needed in an udf. I am using pyspark.
>>
>>
>> As I understand this can be passed in two ways:
>>
>>
>> 1. Broadcast the variable and then use it in the udfs
>>
>>
>> 2. Pass the dictionary in the udf itself, in something like this:
>>
>>
>> def udf1(col1, dict):
>> ..
>> def udf 1 _ fn (dict):
>> return udf(lambda col_ data : udf1( col_data, dict ))
>>
>>
>> df.withColumn("column_new", udf 1 _ fn (dict)("old_column"))
>>
>>
>> Well I have tested with both the ways and it works both ways.
>>
>>
>> Now I am wondering what is fundamentally different between the two. I
>> understand how broadcast work but I am not sure how the data is passed
>> across in the 2nd way. Is the dictionary passed to each executor every
>> time when new task is running on that executor or they are passed only
>> once. Also how the data is passed to the python processes. They are python
>> udfs so I think they are executed natively in python.(Plz correct me if I
>> am wrong). So the data will be serialised and passed to python.
>>
>> So in summary my question is which will be better/efficient way to write
>> the whole thing and why?
>>
>>
>> Thank you!
>>
>>
>> R egards,
>> Dhrub
>>
>
>