User Defined Functions - Execution on Clusters

Eskilson,Aleksander Mon, 06 Jul 2015 12:56:41 -0700

Hi there,

I’m trying to get a feel for how User Defined Functions from SparkSQL (as 
written in Python and registered using the udf function from 
pyspark.sql.functions) are run behind the scenes. Trying to grok the source it 
seems that the native Python function is serialized for distribution to the 
clusters. In practice, it seems to be able to check for other variables and 
functions defined elsewhere in the namepsace and include those in the 
function’s serialization.


Following all this though, when actually run, are Python interpreter instances 
on each node brought up to actually run the function against the RDDs, or can 
the serialized function somehow be run on just the JVM? If bringing up Python 
instances is the execution model, what is the overhead of PySpark UDFs like as 
compared to those registered in Scala?

Thanks,
Alek

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

User Defined Functions - Execution on Clusters

Reply via email to