Yuming Wang created SPARK-29694:
-----------------------------------

             Summary: Execute UDF only once when there are multiple identical 
UDF usages
                 Key: SPARK-29694
                 URL: https://issues.apache.org/jira/browse/SPARK-29694
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Yuming Wang


Example:
{code:sql}
SELECT
        CASE
                WHEN udf1(col1, 'swd') = '2' THEN 'Facebook'
                WHEN udf1(col1, 'swd') = '3' THEN 'Twitter'
                WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest'
                WHEN col2 IN (28,29) THEN
                WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND 
udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex'
                WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND 
udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') 
THEN 'Yahoo'
                WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL THEN 
'Google'
                WHEN udf1(col1, 'rd') LIKE '%google%' OR udf1(col1, 'rd') LIKE 
'%gmail%' THEN 'Google'
                WHEN udf1(col1, 'rd') LIKE '%yahoo%' THEN 'Yahoo'
                WHEN udf1(col1, 'rd') LIKE '%bing%' THEN 'Bing'
                WHEN udf1(col1, 'rd') LIKE '%facebook%' THEN 'Facebook'
                WHEN udf1(col1, 'rd') LIKE '%pinterest%' THEN 'Pinterest'
                WHEN udf1(col1, 'rd') LIKE '%twitter%' OR udf1(col1, 'rd') LIKE 
'%t.co' THEN 'Twitter'
                WHEN udf1(col1, 'rd') LIKE '%baidu%' THEN 'Baidu'
                WHEN udf1(col1, 'rd') LIKE '%yandex%' THEN 'Yandex'
                WHEN udf1(col1, 'rd') LIKE '%aol.%' THEN 'AOL'
                WHEN udf1(col1, 'rd') LIKE '%ask.%' THEN 'Ask'
                WHEN udf1(col1, 'rd') LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
                WHEN udf1(col1, 'rd') LIKE '%t-online.de' THEN 'T-Online'
                WHEN udf1(col1, 'rd') LIKE '%com-kleinanzeigen.%' OR udf1(col1, 
'rd') LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen'
                WHEN udf1(col1, 'rd') LIKE '%com.%' OR udf1(col1, 'rd') LIKE 
'%comdesc.%' THEN 'com'
                WHEN udf1(col1, 'rd') LIKE '%paypal.%' THEN 'PayPal'
                WHEN udf1(col1, 'rd') IS NULL THEN 'None'
                ELSE 'Other' END AS source_domain,
        COUNT(*) AS cnt
FROM
        tbl s
GROUP BY
        1
{code}

We can rewrite it to:

{code:sql}
SELECT
        CASE
                WHEN udf1(col1, 'swd') = '2' THEN 'Facebook'
                WHEN udf1(col1, 'swd') = '3' THEN 'Twitter'
                WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest'
                WHEN col2 IN (28,29) THEN 'Google'
                WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') 
IN ('71188223167180', '14361105000167180') THEN 'Yandex'
                WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') 
IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo'
                WHEN col2 IN (10,16,18) AND col1 IS NULL THEN 'Google'
                WHEN col1 LIKE '%google%' OR col1 LIKE '%gmail%' THEN 'Google'
                WHEN col1 LIKE '%yahoo%' THEN 'Yahoo'
                WHEN col1 LIKE '%bing%' THEN 'Bing'
                WHEN col1 LIKE '%facebook%' THEN 'Facebook'
                WHEN col1 LIKE '%pinterest%' THEN 'Pinterest'
                WHEN col1 LIKE '%twitter%' OR col1 LIKE '%t.co' THEN 'Twitter'
                WHEN col1 LIKE '%baidu%' THEN 'Baidu'
                WHEN col1 LIKE '%yandex%' THEN 'Yandex'
                WHEN col1 LIKE '%aol.%' THEN 'AOL'
                WHEN col1 LIKE '%ask.%' THEN 'Ask'
                WHEN col1 LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
                WHEN col1 LIKE '%t-online.de' THEN 'T-Online'
                WHEN col1 LIKE '%com-kleinanzeigen.%' OR col1 LIKE 
'%kleinanzeigen%' THEN 'Kleinanzeigen'
                WHEN col1 LIKE '%com.%' OR col1 LIKE '%comdesc.%' THEN 'com'
                WHEN col1 LIKE '%paypal.%' THEN 'PayPal'
                WHEN col1 IS NULL THEN 'None'
                ELSE 'Other' END AS source_domain,
        COUNT(*) AS cnt
FROM
        (SELECT *, udf1(col1, 'rd') as col1 FROM tbl) s
GROUP BY
        1
{code}

It would be great if we could optimize it by the framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to