[ https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534923#comment-15534923 ]
Jacob Eisinger commented on SPARK-17728: ---------------------------------------- Thanks for the explanation, but I still think this is an issue. If Spark assumed their was no side effects and optimize accordingly, their would be not issue: the UDF would be called once per row (1). However, Spark calls a costly function many times leading to inefficiency. In our production code, we have a function that takes in a long string and classifies it under a number of different dimensions. This is a very CPU intensive operation and is a pure function . Obviously, if Spark's optimizer calls the functions multiple times, this is _not_ optimal in this scenario. I think it is intuitive to most that the following code would call the UDF once per row (1): {code} val exploded = as .withColumn("structured_information", fUdf('a)) .withColumn("plus_one", 'structured_information("plusOne")) .withColumn("squared", 'structured_information("squared")) {code} However, Spark calls the UDF three times per row! Is this what you would expect? What am I missing? (1) - "Once per row" - except when the row needs to recomputed such as when workers are lost. (2) - I attempted to model the long operation via Thread.sleep(); as you mentioned this does have a slight side effect. Maybe I should have summed the first billion counting numbers to illustrate the slow down? > UDFs are run too many times > --------------------------- > > Key: SPARK-17728 > URL: https://issues.apache.org/jira/browse/SPARK-17728 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.0.0 > Environment: Databricks Cloud / Spark 2.0.0 > Reporter: Jacob Eisinger > Priority: Minor > Attachments: over_optimized_udf.html > > > h3. Background > Llonger running processes that might run analytics or contact external > services from UDFs. The response might not just be a field, but instead a > structure of information. When attempting to break out this information, it > is critical that query is optimized correctly. > h3. Steps to Reproduce > # Create some sample data. > # Create a UDF that returns a multiple attributes. > # Run UDF over some data. > # Create new columns from the multiple attributes. > # Observe run time. > h3. Actual Results > The UDF is executed *multiple times* _per row._ > h3. Expected Results > The UDF should only be executed *once* _per row._ > h3. Workaround > Cache the Dataset after UDF execution. > h3. Details > For code and more details, see [^over_optimized_udf.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org