Hi All,
For the use case where the expensive UDF has constant inputs (literals) we
have proposed the following JIRA and PR which calculates the UDF only once
in the driver:
https://issues.apache.org/jira/browse/SPARK-27692
https://github.com/apache/spark/pull/24593
If considering revisiting the
I agree that 'non-deterministic' is the right term for what it currently
does: mark an expression as non-deterministic (returns different values
for the same input, e.g. rand()), and the optimizer does its best to not
break semantics when moving expressions around.
In case of expensive
We really need some documents to define what non-deterministic means.
AFAIK, non-deterministic expressions may produce a different result for the
same input row, if the already processed input rows are different.
The optimizer tries its best to not change the input sequence
of non-deterministic
That was very interesting, thanks Enrico.
Sean, IIRC it also prevents push down of the UDF in Catalyst in some cases.
Regards,
Ruben
> On 7 Nov 2019, at 11:09, Sean Owen wrote:
>
> Interesting, what does non-deterministic do except have this effect?
> aside from the naming, it could be a
Interesting, what does non-deterministic do except have this effect?
aside from the naming, it could be a fine use of this flag if that's
all it effectively does. I'm not sure I'd introduce another flag with
the same semantics just over naming. If anything 'expensive' also
isn't the right word,