[ https://issues.apache.org/jira/browse/SPARK-44979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dinesh Dharme updated SPARK-44979: ---------------------------------- Shepherd: Deepak Goyal > Cache results of simple udfs on executors if same arguments are passed. > ----------------------------------------------------------------------- > > Key: SPARK-44979 > URL: https://issues.apache.org/jira/browse/SPARK-44979 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.4.1 > Reporter: Dinesh Dharme > Priority: Minor > > Consider two dataframes : > {{keyword_given = [ > ["green pstr",], > ["greenpstr",], > ["wlmrt", ], > ["walmart",], > ["walmart super",] > ]}} > {{variations = [ > ("type green pstr", "ABC", 100), > ("type green pstr","PQR",200), > ("type green pstr", "NZSD", 2999), > ("wlmrt payment","walmart",200), > ("wlmrt solutions", "walmart", 200), > ("nppssdwlmrt", "walmart", 2000) > ]}} > {{Imagine I have a task to do fuzzy substring matching between keyword and > variation[0] using in built levenstein function. It is possible to optimize > this futher in the code itself where we extract out the uniques and then do > fuzzy matching on the uniques and join back with the original tables. }} > {{But it could be possible as an optimization to cache the results of the > already computed udfs till now and do a lookup on each executor separately.}} > Just a thought. Not sure if it makes any sense. This behaviour could be > behind a config. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org