Hi there, I’m trying to get a feel for how User Defined Functions from SparkSQL (as written in Python and registered using the udf function from pyspark.sql.functions) are run behind the scenes. Trying to grok the source it seems that the native Python function is serialized for distribution to the clusters. In practice, it seems to be able to check for other variables and functions defined elsewhere in the namepsace and include those in the function’s serialization.
Following all this though, when actually run, are Python interpreter instances on each node brought up to actually run the function against the RDDs, or can the serialized function somehow be run on just the JVM? If bringing up Python instances is the execution model, what is the overhead of PySpark UDFs like as compared to those registered in Scala? Thanks, Alek CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.