Interesting, thanks for the heads up. On 7/6/15, 7:19 PM, "Davies Liu" <dav...@databricks.com> wrote:
>Currently, Python UDFs run in a Python instances, are MUCH slower than >Scala ones (from 10 to 100x). There is JIRA to improve the >performance: https://issues.apache.org/jira/browse/SPARK-8632, After >that, they will be still much slower than Scala ones (because Python >is lower and the overhead for calling Python). > >On Mon, Jul 6, 2015 at 12:55 PM, Eskilson,Aleksander ><alek.eskil...@cerner.com> wrote: >> Hi there, >> >> I’m trying to get a feel for how User Defined Functions from SparkSQL >>(as >> written in Python and registered using the udf function from >> pyspark.sql.functions) are run behind the scenes. Trying to grok the >>source >> it seems that the native Python function is serialized for distribution >>to >> the clusters. In practice, it seems to be able to check for other >>variables >> and functions defined elsewhere in the namepsace and include those in >>the >> function’s serialization. >> >> Following all this though, when actually run, are Python interpreter >> instances on each node brought up to actually run the function against >>the >> RDDs, or can the serialized function somehow be run on just the JVM? If >> bringing up Python instances is the execution model, what is the >>overhead of >> PySpark UDFs like as compared to those registered in Scala? >> >> Thanks, >> Alek >> CONFIDENTIALITY NOTICE This message and any included attachments are >>from >> Cerner Corporation and are intended only for the addressee. The >>information >> contained in this message is confidential and may constitute inside or >> non-public information under international, federal, or state securities >> laws. Unauthorized forwarding, printing, copying, distribution, or use >>of >> such information is strictly prohibited and may be unlawful. If you are >>not >> the addressee, please promptly delete this message and notify the >>sender of >> the delivery error by e-mail or you may call Cerner's corporate offices >>in >> Kansas City, Missouri, U.S.A at (+1) (816)221-1024. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org