[jira] [Commented] (SPARK-29694) Execute UDF only once when there are multiple identical UDF usages

Xuedong Luan (Jira) Thu, 31 Oct 2019 20:26:35 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-29694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964571#comment-16964571
 ]


Xuedong Luan commented on SPARK-29694:
--------------------------------------

hi [~yumwang]

 I will work on this Jira

> Execute UDF only once when there are multiple identical UDF usages
> ------------------------------------------------------------------
>
>                 Key: SPARK-29694
>                 URL: https://issues.apache.org/jira/browse/SPARK-29694
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Yuming Wang
>            Priority: Major
>
> Example:
> {code:sql}
> SELECT
>       CASE
>               WHEN udf1(col1, 'swd') = '2' THEN 'Facebook'
>               WHEN udf1(col1, 'swd') = '3' THEN 'Twitter'
>               WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest'
>               WHEN col2 IN (28,29) THEN
>               WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND 
> udf1(col1, 'rot') IN ('71188223167180', '14361105000167180') THEN 'Yandex'
>               WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL AND 
> udf1(col1, 'rot') IN ('4686145537108740', '7055082982390', '7113399718530') 
> THEN 'Yahoo'
>               WHEN col2 IN (10,16,18) AND udf1(col1, 'rd') IS NULL THEN 
> 'Google'
>               WHEN udf1(col1, 'rd') LIKE '%google%' OR udf1(col1, 'rd') LIKE 
> '%gmail%' THEN 'Google'
>               WHEN udf1(col1, 'rd') LIKE '%yahoo%' THEN 'Yahoo'
>               WHEN udf1(col1, 'rd') LIKE '%bing%' THEN 'Bing'
>               WHEN udf1(col1, 'rd') LIKE '%facebook%' THEN 'Facebook'
>               WHEN udf1(col1, 'rd') LIKE '%pinterest%' THEN 'Pinterest'
>               WHEN udf1(col1, 'rd') LIKE '%twitter%' OR udf1(col1, 'rd') LIKE 
> '%t.co' THEN 'Twitter'
>               WHEN udf1(col1, 'rd') LIKE '%baidu%' THEN 'Baidu'
>               WHEN udf1(col1, 'rd') LIKE '%yandex%' THEN 'Yandex'
>               WHEN udf1(col1, 'rd') LIKE '%aol.%' THEN 'AOL'
>               WHEN udf1(col1, 'rd') LIKE '%ask.%' THEN 'Ask'
>               WHEN udf1(col1, 'rd') LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
>               WHEN udf1(col1, 'rd') LIKE '%t-online.de' THEN 'T-Online'
>               WHEN udf1(col1, 'rd') LIKE '%com-kleinanzeigen.%' OR udf1(col1, 
> 'rd') LIKE '%kleinanzeigen%' THEN 'Kleinanzeigen'
>               WHEN udf1(col1, 'rd') LIKE '%com.%' OR udf1(col1, 'rd') LIKE 
> '%comdesc.%' THEN 'com'
>               WHEN udf1(col1, 'rd') LIKE '%paypal.%' THEN 'PayPal'
>               WHEN udf1(col1, 'rd') IS NULL THEN 'None'
>               ELSE 'Other' END AS source_domain,
>       COUNT(*) AS cnt
> FROM
>       tbl s
> GROUP BY
>       1
> {code}
> We can rewrite it to:
> {code:sql}
> SELECT
>       CASE
>               WHEN udf1(col1, 'swd') = '2' THEN 'Facebook'
>               WHEN udf1(col1, 'swd') = '3' THEN 'Twitter'
>               WHEN udf1(col1, 'swd') = '11' THEN 'Pinterest'
>               WHEN col2 IN (28,29) THEN 'Google'
>               WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') 
> IN ('71188223167180', '14361105000167180') THEN 'Yandex'
>               WHEN col2 IN (10,16,18) AND col1 IS NULL AND udf1(col1, 'rot') 
> IN ('4686145537108740', '7055082982390', '7113399718530') THEN 'Yahoo'
>               WHEN col2 IN (10,16,18) AND col1 IS NULL THEN 'Google'
>               WHEN col1 LIKE '%google%' OR col1 LIKE '%gmail%' THEN 'Google'
>               WHEN col1 LIKE '%yahoo%' THEN 'Yahoo'
>               WHEN col1 LIKE '%bing%' THEN 'Bing'
>               WHEN col1 LIKE '%facebook%' THEN 'Facebook'
>               WHEN col1 LIKE '%pinterest%' THEN 'Pinterest'
>               WHEN col1 LIKE '%twitter%' OR col1 LIKE '%t.co' THEN 'Twitter'
>               WHEN col1 LIKE '%baidu%' THEN 'Baidu'
>               WHEN col1 LIKE '%yandex%' THEN 'Yandex'
>               WHEN col1 LIKE '%aol.%' THEN 'AOL'
>               WHEN col1 LIKE '%ask.%' THEN 'Ask'
>               WHEN col1 LIKE '%duckduckgo.%' THEN 'DuckDuckGo'
>               WHEN col1 LIKE '%t-online.de' THEN 'T-Online'
>               WHEN col1 LIKE '%com-kleinanzeigen.%' OR col1 LIKE 
> '%kleinanzeigen%' THEN 'Kleinanzeigen'
>               WHEN col1 LIKE '%com.%' OR col1 LIKE '%comdesc.%' THEN 'com'
>               WHEN col1 LIKE '%paypal.%' THEN 'PayPal'
>               WHEN col1 IS NULL THEN 'None'
>               ELSE 'Other' END AS source_domain,
>       COUNT(*) AS cnt
> FROM
>       (SELECT *, udf1(col1, 'rd') as col1 FROM tbl) s
> GROUP BY
>       1
> {code}
> It would be great if we could optimize it by the framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29694) Execute UDF only once when there are multiple identical UDF usages

Reply via email to