Yuming Wang created SPARK-29694: ----------------------------------- Summary: Execute UDF only once when there are multiple identical UDF usages Key: SPARK-29694 URL: https://issues.apache.org/jira/browse/SPARK-29694 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang
Example: {code:sql} SELECT CASE WHEN udf1(col1, 'swd') = '2' THEN 'Facebook' WHEN udf1(col1, 'swd') = '3' THEN 'Twitter' WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest' WHEN col2 IN (28,29) THEN WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex' WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo' WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL THEN 'Google' WHEN udf1(col1, 'rd') LIKE '%google%' OR udf1(col1, 'rd') LIKE '%gmail%' THEN 'Google' WHEN udf1(col1, 'rd') LIKE '%yahoo%' THEN 'Yahoo' WHEN udf1(col1, 'rd') LIKE '%bing%' THEN 'Bing' WHEN udf1(col1, 'rd') LIKE '%facebook%' THEN 'Facebook' WHEN udf1(col1, 'rd') LIKE '%pinterest%' THEN 'Pinterest' WHEN udf1(col1, 'rd') LIKE '%twitter%' OR udf1(col1, 'rd') LIKE '%t.co' THEN 'Twitter' WHEN udf1(col1, 'rd') LIKE '%baidu%' THEN 'Baidu' WHEN udf1(col1, 'rd') LIKE '%yandex%' THEN 'Yandex' WHEN udf1(col1, 'rd') LIKE '%aol.%' THEN 'AOL' WHEN udf1(col1, 'rd') LIKE '%ask.%' THEN 'Ask' WHEN udf1(col1, 'rd') LIKE '%duckduckgo.%' THEN 'DuckDuckGo' WHEN udf1(col1, 'rd') LIKE '%t-online.de' THEN 'T-Online' WHEN udf1(col1, 'rd') LIKE '%com-kleinanzeigen.%' OR udf1(col1, 'rd') LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen' WHEN udf1(col1, 'rd') LIKE '%com.%' OR udf1(col1, 'rd') LIKE '%comdesc.%' THEN 'com' WHEN udf1(col1, 'rd') LIKE '%paypal.%' THEN 'PayPal' WHEN udf1(col1, 'rd') IS NULL THEN 'None' ELSE 'Other' END AS source_domain, COUNT(*) AS cnt FROM tbl s GROUP BY 1 {code} We can rewrite it to: {code:sql} SELECT CASE WHEN udf1(col1, 'swd') = '2' THEN 'Facebook' WHEN udf1(col1, 'swd') = '3' THEN 'Twitter' WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest' WHEN col2 IN (28,29) THEN 'Google' WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex' WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo' WHEN col2 IN (10,16,18) AND col1 IS NULL THEN 'Google' WHEN col1 LIKE '%google%' OR col1 LIKE '%gmail%' THEN 'Google' WHEN col1 LIKE '%yahoo%' THEN 'Yahoo' WHEN col1 LIKE '%bing%' THEN 'Bing' WHEN col1 LIKE '%facebook%' THEN 'Facebook' WHEN col1 LIKE '%pinterest%' THEN 'Pinterest' WHEN col1 LIKE '%twitter%' OR col1 LIKE '%t.co' THEN 'Twitter' WHEN col1 LIKE '%baidu%' THEN 'Baidu' WHEN col1 LIKE '%yandex%' THEN 'Yandex' WHEN col1 LIKE '%aol.%' THEN 'AOL' WHEN col1 LIKE '%ask.%' THEN 'Ask' WHEN col1 LIKE '%duckduckgo.%' THEN 'DuckDuckGo' WHEN col1 LIKE '%t-online.de' THEN 'T-Online' WHEN col1 LIKE '%com-kleinanzeigen.%' OR col1 LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen' WHEN col1 LIKE '%com.%' OR col1 LIKE '%comdesc.%' THEN 'com' WHEN col1 LIKE '%paypal.%' THEN 'PayPal' WHEN col1 IS NULL THEN 'None' ELSE 'Other' END AS source_domain, COUNT(*) AS cnt FROM (SELECT *, udf1(col1, 'rd') as col1 FROM tbl) s GROUP BY 1 {code} It would be great if we could optimize it by the framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org