[jira] [Commented] (SPARK-17728) UDFs are run too many times

2022-06-22 Thread Josh Rosen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557756#comment-17557756
 ] 

Josh Rosen commented on SPARK-17728:


As of SPARK-36718 in Spark 3.3 I think the {{explode(array(udf()))}} trick 
should no longer be needed: Spark will avoid collapsing projections which would 
lead to duplication of expensive-to-evaluate expressions. 

There still might be some rare cases where you might need that trick (e.g. to 
work around SPARK-38485), but I think most cases should be addressed by Spark 
3.3's improved CollapseProject optimizer rule.

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-10-06 Thread Jacob Eisinger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552708#comment-15552708
 ] 

Jacob Eisinger commented on SPARK-17728:


[~hvanhovell], thanks for looking into this!  Unless you can think of a better 
way to ensure the UDF doesn't get executed multiple times, we are going to go 
with your workaround of:

{code}
val exploded = as
   .withColumn("structured_information", explode(array(fUdf('a
   .withColumn("plus_one", 'structured_information("plusOne"))
   .withColumn("squared", 'structured_information("squared"))
{code}

{code}
exploded2.explain
== Physical Plan ==
*Project [a#10, structured_information#159, structured_information#159.plusOne 
AS plus_one#161, structured_information#159.squared AS squared#166]
+- Generate explode(array(if (isnull(a#10)) null else UDF(a#10))), true, false, 
[a#10, structured_information#159]
   +- *BatchedScan parquet [a#10] Format: ParquetFormat, InputPaths: 
file:/tmp/as.parquet, PushedFilters: [], ReadSchema: struct
{code}

I reckon it might impact GC a bit with the creation of the extra arrays --- 
but, that sure beats the cost of running those expensive UDFs!  Thanks again 
for the excellent explanations!

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-10-05 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549514#comment-15549514
 ] 

Herman van Hovell commented on SPARK-17728:
---

I think we eventually should add it, however this is currently quite hard to 
implement properly. There is also the matter that the JIT is really good at 
inlining small functions and doing common subexpression elimination for us. In 
this particular case it make more sense to me to add costs to UDFs or to add a 
mechanism that prevents project collapsing.

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-10-05 Thread Jacob Eisinger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15549478#comment-15549478
 ] 

Jacob Eisinger commented on SPARK-17728:


Thanks for the great explanation!  Are there plans for subexpression 
elimination for whole stage code generation --- do you think their should be?

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-10-04 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547358#comment-15547358
 ] 

Herman van Hovell commented on SPARK-17728:
---

The second one does not trigger the behavior because this is turned into a 
LocalRelation, these are evaluated during optimization and before we collapse 
projections.

The following in memory dataframe should trigger the same behavior:
{noformat}
spark.range(1, 10).withColumn("expensive_udf_result", 
fUdf($"id")).withColumn("b", $"expensive_udf_result" + 100)
{noformat}

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-10-04 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547352#comment-15547352
 ] 

Herman van Hovell commented on SPARK-17728:
---

There are different evaluation paths in Spark SQL:
- Interpreted. Expressions are evaluated using an eval(...) method. Plans are 
evaluated using iterators (volcano model). This is what I mean by the 
completely interpreted path.
- Expression Codegenerated. This means that all expressions are evaluated using 
a code generated function. Plans are evaluated using iterators.
- Wholestafe Codegenerated. All expressions and most plans are evaluated using 
code generation.

I think you are using whole stage code generation. This does not support common 
subexpression elimination.



> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-10-03 Thread Jacob Eisinger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544309#comment-15544309
 ] 

Jacob Eisinger commented on SPARK-17728:


Also, it is interesting for me to note that this occurs for parquets --- and 
not generating the Dataset in memory.

For example,
{code}
val as = spark.read.parquet("/tmp/as.parquet")
{code}
triggers the behavior, but
{code}
val as = (1 to 10).toDF("a")
{code}
does not.

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-10-03 Thread Jacob Eisinger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544303#comment-15544303
 ] 

Jacob Eisinger commented on SPARK-17728:


What do you mean by _completely interpreted path_?

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-10-03 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543969#comment-15543969
 ] 

Herman van Hovell commented on SPARK-17728:
---

First of all, we implement subexpression elimination (which is a form of 
memoization), and this should prevent multiple invocations from happening. I am 
quite curious why this is not triggering in your case. Are you on a completely 
interpreted path?

Cost functions for a UDF is doable, we would have to this for expression trees 
though, and this is a non-trivial thing to implement.



> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-10-03 Thread Jacob Eisinger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15542689#comment-15542689
 ] 

Jacob Eisinger commented on SPARK-17728:


Thanks for the explanation and the tricky code snippet!  I kind of figured it 
was optimizing incorrectly / over optimizing.  It sounds like this is not a 
defect because normally this optimization of collapsing projects is the desired 
route.  Correct?

Do you think it is worth filing a feature request to allow working with costly 
UDFs?  Possibly:
 * Memoize UDFs / other transforms on a per row basis.
 * Manually override costs for UDFs.

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535023#comment-15535023
 ] 

Herman van Hovell commented on SPARK-17728:
---

I think calling {{explain(true)}} on your plans helps to understand what is 
going on.

Spark executes the UDF 3x times because the optimizer collapses subsequent 
projects (a project normally being much more expensive than a UDF call). In 
your case the three projects get rewritten into one project, and the 
expressions are rewritten in the following form:
{noformat}
structured_information -> fUdf('a)
plus_one -> fUdf('a).get("plusOne")
squared -> fUdf('a).get("squared")
{noformat}

It is a bit tricky to get around this, this might work:
{noformat}
val exploded = as
   .withColumn("structured_information", explode(array(fUdf('a
.withColumn("plus_one", 'structured_information("plusOne"))
.withColumn("squared", 'structured_information("squared"))
{noformat}

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Jacob Eisinger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534923#comment-15534923
 ] 

Jacob Eisinger commented on SPARK-17728:


Thanks for the explanation, but I still think this is an issue.

If Spark assumed their was no side effects and optimize accordingly, their 
would be not issue: the UDF would be called once per row (1).  However, Spark 
calls a costly function many times leading to inefficiency.

In our production code, we have a function that takes in a long string and 
classifies it under a number of different dimensions.  This is a very CPU 
intensive operation and is a pure function .  Obviously, if Spark's optimizer 
calls the functions multiple times, this is _not_ optimal in this scenario.

I think it is intuitive to most that the following code would call the UDF once 
per row (1):
{code}
val exploded = as
  .withColumn("structured_information", fUdf('a))
  .withColumn("plus_one", 'structured_information("plusOne"))
  .withColumn("squared", 'structured_information("squared"))
{code}
However, Spark calls the UDF three times per row!  Is this what you would 
expect?  What am I missing?

(1) - "Once per row" - except when the row needs to recomputed such as when 
workers are lost. 
(2) - I attempted to model the long operation via Thread.sleep(); as you 
mentioned this does have a slight side effect.  Maybe I should have summed the 
first billion counting numbers to illustrate the slow down?

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534792#comment-15534792
 ] 

Herman van Hovell commented on SPARK-17728:
---

I am going to close this as not a problem, but feel free to follow up.

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534790#comment-15534790
 ] 

Herman van Hovell commented on SPARK-17728:
---

Spark assumes UDF's are pure function; we do not guarantee that a function is 
only executed once. This is due to the way the optimizer works, and the fact 
that sometimes retry stages. We could add a flag to UDF to prevent this from 
working, but this would be a considerable engineering effort.

The example you give is not really a pure function, as its side effects makes 
the thread stop (changes state). 

If you are connecting to an external service, then I would suggest using 
{{Dataset.mapPartitions(...)}} (similar to a generator). This will allow you to 
setup one connection per partition, and you can call a method as much or as 
little as you like.


> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Jacob Eisinger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534583#comment-15534583
 ] 

Jacob Eisinger commented on SPARK-17728:


I am a little confused.
# Could you explain how a generator would apply here?
# You mentioned that UDFs should be pure functions.  Is Spark optimizing the 
function calls as if they are pure functions?

(Also, please check out my example --- the UDF there _should_ be a pure 
function.)

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534059#comment-15534059
 ] 

Herman van Hovell commented on SPARK-17728:
---

You really should not try to use any external state in a UDF (it should be a 
pure function).

It might be an idea to use a generator in this case. These are guaranteed to 
only execute once for an input tuple.


> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org