Re: [DISCUSS] Expensive deterministic UDFs

2019-12-31 Thread Guy Khazma
Hi All, For the use case where the expensive UDF has constant inputs (literals) we have proposed the following JIRA and PR which calculates the UDF only once in the driver: https://issues.apache.org/jira/browse/SPARK-27692 https://github.com/apache/spark/pull/24593 If considering revisiting the

Re: [DISCUSS] Expensive deterministic UDFs

2019-11-08 Thread Enrico Minack
I agree that 'non-deterministic' is the right term for what it currently does: mark an expression as non-deterministic (returns different values for the same input, e.g. rand()), and the optimizer does its best to not break semantics when moving expressions around. In case of expensive

Re: [DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Wenchen Fan
We really need some documents to define what non-deterministic means. AFAIK, non-deterministic expressions may produce a different result for the same input row, if the already processed input rows are different. The optimizer tries its best to not change the input sequence of non-deterministic

Re: [DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Rubén Berenguel
That was very interesting, thanks Enrico. Sean, IIRC it also prevents push down of the UDF in Catalyst in some cases. Regards, Ruben > On 7 Nov 2019, at 11:09, Sean Owen wrote: > > Interesting, what does non-deterministic do except have this effect? > aside from the naming, it could be a

Re: [DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Sean Owen
Interesting, what does non-deterministic do except have this effect? aside from the naming, it could be a fine use of this flag if that's all it effectively does. I'm not sure I'd introduce another flag with the same semantics just over naming. If anything 'expensive' also isn't the right word,