Currently, Python UDFs run in a Python instances, are MUCH slower than
Scala ones (from 10 to 100x). There is JIRA to improve the
performance:, After
that, they will be still much slower than Scala ones (because Python
is lower and the overhead for calling Python).

On Mon, Jul 6, 2015 at 12:55 PM, Eskilson,Aleksander
<> wrote:
> Hi there,
> I’m trying to get a feel for how User Defined Functions from SparkSQL (as
> written in Python and registered using the udf function from
> pyspark.sql.functions) are run behind the scenes. Trying to grok the source
> it seems that the native Python function is serialized for distribution to
> the clusters. In practice, it seems to be able to check for other variables
> and functions defined elsewhere in the namepsace and include those in the
> function’s serialization.
> Following all this though, when actually run, are Python interpreter
> instances on each node brought up to actually run the function against the
> RDDs, or can the serialized function somehow be run on just the JVM? If
> bringing up Python instances is the execution model, what is the overhead of
> PySpark UDFs like as compared to those registered in Scala?
> Thanks,
> Alek
> CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to