Re: Collections passed from driver to executors

2019-09-23 Thread Reynold Xin
A while ago we changed it so the task gets broadcasted too, so I think the two 
are fairly similar.

On Mon, Sep 23, 2019 at 8:17 PM, Dhrubajyoti Hati < dhruba.w...@gmail.com > 
wrote:

> 
> I was wondering if anyone could help with this question.
> 
> On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, < dhruba. work@ gmail. com
> ( dhruba.w...@gmail.com ) > wrote:
> 
> 
>> Hi,
>> 
>> 
>> I have a question regarding passing a dictionary from driver to executors
>> in spark on yarn. This dictionary is needed in an udf. I am using pyspark.
>> 
>> 
>> As I understand this can be passed in two ways:
>> 
>> 
>> 1. Broadcast the variable and then use it in the udfs
>> 
>> 
>> 2. Pass the dictionary in the udf itself, in something like this:
>> 
>> 
>>   def udf1(col1, dict):
>>    ..
>>   def udf 1 _ fn (dict):
>>     return udf(lambda col_ data : udf1( col_data, dict ))
>> 
>> 
>>   df.withColumn("column_new", udf 1 _ fn (dict)("old_column"))
>> 
>> 
>> Well I have tested with both the ways and it works both ways.
>> 
>> 
>> Now I am wondering what is fundamentally different between the two. I
>> understand how broadcast work but I am not sure how the data is passed
>> across in the 2nd way. Is the dictionary passed to each executor every
>> time when new task is running on that executor or they are passed only
>> once. Also how the data is passed to the python processes. They are python
>> udfs so I think they are executed natively in python.(Plz correct me if I
>> am wrong). So the data will be serialised and passed to python.
>> 
>> So in summary my question is which will be better/efficient way to write
>> the whole thing and why?
>> 
>> 
>> Thank you!
>> 
>> 
>> R egards,
>> Dhrub
>> 
> 
>

Re: Collections passed from driver to executors

2019-09-23 Thread Dhrubajyoti Hati
I was wondering if anyone could help with this question.

On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, 
wrote:

> Hi,
>
> I have a question regarding passing a dictionary from driver to executors
> in spark on yarn. This dictionary is needed in an udf. I am using pyspark.
>
> As I understand this can be passed in two ways:
>
> 1. Broadcast the variable and then use it in the udfs
>
> 2. Pass the dictionary in the udf itself, in something like this:
>
>   def udf1(col1, dict):
>..
>   def udf1_fn(dict):
> return udf(lambda col_data: udf1(col_data, dict))
>
>   df.withColumn("column_new", udf1_fn(dict)("old_column"))
>
> Well I have tested with both the ways and it works both ways.
>
> Now I am wondering what is fundamentally different between the two. I
> understand how broadcast work but I am not sure how the data is passed
> across in the 2nd way. Is the dictionary passed to each executor every time
> when new task is running on that executor or they are passed only once.
> Also how the data is passed to the python processes. They are python udfs
> so I think they are executed natively in python.(Plz correct me if I am
> wrong). So the data will be serialised and passed to python.
>
> So in summary my question is which will be better/efficient way to write
> the whole thing and why?
>
> Thank you!
>
> Regards,
> Dhrub
>