[jira] [Commented] (SPARK-11438) Allow users to define nondeterministic UDFs
[ https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081818#comment-15081818 ] Antonio Piccolboni commented on SPARK-11438: OK, thanks for the clarification. > Allow users to define nondeterministic UDFs > --- > > Key: SPARK-11438 > URL: https://issues.apache.org/jira/browse/SPARK-11438 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai > > Right now, all UDFs are deterministic. It will be great if we allow users to > define nondeterministic UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11438) Allow users to define nondeterministic UDFs
[ https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986377#comment-14986377 ] Antonio Piccolboni commented on SPARK-11438: The problem with nondeterminism is that it combines poorly with a computing model were retries are allowed. In fact, it allows programs that should fail to return incorrect results. Imagine a UDF rnorm(mu, sigma) that returns samples from a normal distribution. Imagine that the larger program containing it fails when the sample returned is in the top 10-percentile. If enough fault tolerance is built in, the program will terminate correctly but rnorm will sample from a new distribution that's like a normal but truncated at the top 90th percentile and renormalized. If thinking about a continuous distribution hampers intuition, imagine a dice-simulating UDF, and a program that returns the average of many throws. Imagine the program or the UDF itself fails when the sampled value is or should be 1. The returned average will be approximately 4 instead of 3.5. In light of this, I don't think this feature should be added. > Allow users to define nondeterministic UDFs > --- > > Key: SPARK-11438 > URL: https://issues.apache.org/jira/browse/SPARK-11438 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, all UDFs are deterministic. It will be great if we allow users to > define nondeterministic UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11438) Allow users to define nondeterministic UDFs
[ https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986390#comment-14986390 ] Xiangrui Meng commented on SPARK-11438: --- [~piccolbo] `nondeterministic` means that Catalyst won't call `eval` multiple times, for example, {code} df.select(randn(0L).as("r")) .select(col("r") + 1.0, col("r") + 2.0) {code} won't trigger the random number generator more than once on a single record. It doesn't mean that the result would be different in multiple job runs. You can interpret `nondeterministic` as `do not evaluate more than once if it appears multiple times in a single projection`. For UDFs, it is useful to have this flag to avoid re-do heavy computations. > Allow users to define nondeterministic UDFs > --- > > Key: SPARK-11438 > URL: https://issues.apache.org/jira/browse/SPARK-11438 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, all UDFs are deterministic. It will be great if we allow users to > define nondeterministic UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11438) Allow users to define nondeterministic UDFs
[ https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986570#comment-14986570 ] Yin Huai commented on SPARK-11438: -- [~piccolbo] Thank you for your comments! As mentioned by [~mengxr], the semantic of nondeterministic is more like generating random values in a deterministic way. We will look at having a better name. > Allow users to define nondeterministic UDFs > --- > > Key: SPARK-11438 > URL: https://issues.apache.org/jira/browse/SPARK-11438 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, all UDFs are deterministic. It will be great if we allow users to > define nondeterministic UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11438) Allow users to define nondeterministic UDFs
[ https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984223#comment-14984223 ] Apache Spark commented on SPARK-11438: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/9393 > Allow users to define nondeterministic UDFs > --- > > Key: SPARK-11438 > URL: https://issues.apache.org/jira/browse/SPARK-11438 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, all UDFs are deterministic. It will be great if we allow users to > define nondeterministic UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org