[jira] [Commented] (SPARK-11438) Allow users to define nondeterministic UDFs

2016-01-04 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081818#comment-15081818
 ] 

Antonio Piccolboni commented on SPARK-11438:


OK,  thanks for the clarification.

> Allow users to define nondeterministic UDFs
> ---
>
> Key: SPARK-11438
> URL: https://issues.apache.org/jira/browse/SPARK-11438
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>
> Right now, all UDFs are deterministic. It will be great if we allow users to 
> define nondeterministic UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11438) Allow users to define nondeterministic UDFs

2015-11-02 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986377#comment-14986377
 ] 

Antonio Piccolboni commented on SPARK-11438:


The problem with nondeterminism is that it combines poorly with a computing 
model were retries are allowed. In fact, it allows programs that should fail to 
return incorrect results. Imagine a UDF rnorm(mu, sigma) that returns samples 
from a normal distribution. Imagine that the larger program containing it fails 
when the sample returned is in the top 10-percentile. If enough fault tolerance 
is built in, the program will terminate correctly but rnorm will sample from a 
new distribution that's like a normal but truncated at the top 90th percentile 
and renormalized. If thinking about a continuous distribution hampers 
intuition, imagine a dice-simulating UDF, and a program that returns the 
average of many throws. Imagine the program or the UDF itself fails when the 
sampled value is  or should be 1. The returned average will be approximately 4 
instead of 3.5. In light of this, I don't think this feature should be added.

> Allow users to define nondeterministic UDFs
> ---
>
> Key: SPARK-11438
> URL: https://issues.apache.org/jira/browse/SPARK-11438
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, all UDFs are deterministic. It will be great if we allow users to 
> define nondeterministic UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11438) Allow users to define nondeterministic UDFs

2015-11-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986390#comment-14986390
 ] 

Xiangrui Meng commented on SPARK-11438:
---

[~piccolbo] `nondeterministic` means that Catalyst won't call `eval` multiple 
times, for example,

{code}
df.select(randn(0L).as("r"))
  .select(col("r") + 1.0, col("r") + 2.0)
{code}

won't trigger the random number generator more than once on a single record. It 
doesn't mean that the result would be different in multiple job runs. You can 
interpret `nondeterministic` as `do not evaluate more than once if it appears 
multiple times in a single projection`. For UDFs, it is useful to have this 
flag to avoid re-do heavy computations.

> Allow users to define nondeterministic UDFs
> ---
>
> Key: SPARK-11438
> URL: https://issues.apache.org/jira/browse/SPARK-11438
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, all UDFs are deterministic. It will be great if we allow users to 
> define nondeterministic UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11438) Allow users to define nondeterministic UDFs

2015-11-02 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986570#comment-14986570
 ] 

Yin Huai commented on SPARK-11438:
--

[~piccolbo] Thank you for your comments! As mentioned by [~mengxr], the 
semantic of nondeterministic is more like generating random values in a 
deterministic way. We will look at having a better name. 

> Allow users to define nondeterministic UDFs
> ---
>
> Key: SPARK-11438
> URL: https://issues.apache.org/jira/browse/SPARK-11438
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, all UDFs are deterministic. It will be great if we allow users to 
> define nondeterministic UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11438) Allow users to define nondeterministic UDFs

2015-10-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984223#comment-14984223
 ] 

Apache Spark commented on SPARK-11438:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/9393

> Allow users to define nondeterministic UDFs
> ---
>
> Key: SPARK-11438
> URL: https://issues.apache.org/jira/browse/SPARK-11438
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, all UDFs are deterministic. It will be great if we allow users to 
> define nondeterministic UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org