[ 
https://issues.apache.org/jira/browse/SPARK-44979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Dharme updated SPARK-44979:
----------------------------------
    Shepherd: Deepak Goyal

> Cache results of simple udfs on executors if same arguments are passed.
> -----------------------------------------------------------------------
>
>                 Key: SPARK-44979
>                 URL: https://issues.apache.org/jira/browse/SPARK-44979
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.4.1
>            Reporter: Dinesh Dharme
>            Priority: Minor
>
> Consider two dataframes :
> {{keyword_given = [
> ["green pstr",],
> ["greenpstr",],
> ["wlmrt", ],
> ["walmart",],
> ["walmart super",]
> ]}}
> {{variations = [
> ("type green pstr", "ABC", 100),
> ("type green pstr","PQR",200),
> ("type green pstr", "NZSD", 2999),
> ("wlmrt payment","walmart",200),
> ("wlmrt solutions", "walmart", 200),
> ("nppssdwlmrt", "walmart", 2000)
> ]}}
> {{Imagine I have a task to do fuzzy substring matching between keyword and 
> variation[0] using in built levenstein function. It is possible to optimize 
> this futher in the code itself where we extract out the uniques and then do 
> fuzzy matching on the uniques and join back with the original tables. }}
> {{But it could be possible as an optimization to cache the results of the 
> already computed udfs till now and do a lookup on each executor separately.}}
> Just a thought. Not sure if it makes any sense. This behaviour could be 
> behind a config.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to