[ https://issues.apache.org/jira/browse/SPARK-44979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dinesh Dharme updated SPARK-44979: ---------------------------------- Description: Consider two dataframes : {{keyword_given = [ ["green pstr",], ["greenpstr",], ["wlmrt", ], ["walmart",], ["walmart super",] ]}} {{variations = [ ("type green pstr", "ABC", 100), ("type green pstr","PQR",200), ("type green pstr", "NZSD", 2999), ("wlmrt payment","walmart",200), ("wlmrt solutions", "walmart", 200), ("nppssdwlmrt", "walmart", 2000) ]}} {{Imagine I have a task to do fuzzy substring matching between keyword and variation[0] using in built levenstein function. It is possible to optimize this futher in the code itself where we extract out the uniques and then do fuzzy matching on the uniques and join back with the original tables. }} {{But it could be possible as an optimization to cache the results of the already computed udfs till now and do a lookup on each executor separately.}} Just a thought. Not sure if it makes any sense. This behaviour could be behind a config. was: Consider two dataframes : {{keyword_given = [ ["green pstr",], ["greenpstr",], ["wlmrt", ], ["walmart",], ["walmart super",] ]}} {{variations = [ ("type green pstr", "ABC", 100), ("type green pstr","PQR",200), ("type green pstr", "NZSD", 2999), ("wlmrt payment","walmart",200), ("wlmrt solutions", "walmart", 200), ("nppssdwlmrt", "walmart", 2000) ]}} {{Imagine I have a task to do fuzzy substring matching between keyword and variation[0] using in built levenstein function. It is possible to optimize this futher in the code itself where we extract out the uniques and then do fuzzy matching on the uniques and join back with the original table. }} {{But it could be possible as an optimization to cache the results of the already computed udfs till now and do a lookup on each executor separately.}} Just a thought. Not sure if it makes any sense. This behaviour could be behind a config. {{}} {{}} {{}} {{{}{}}}{{{}{}}} > Cache results of simple udfs on executors if same arguments are passed. > ----------------------------------------------------------------------- > > Key: SPARK-44979 > URL: https://issues.apache.org/jira/browse/SPARK-44979 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.4.1 > Reporter: Dinesh Dharme > Priority: Minor > > Consider two dataframes : > {{keyword_given = [ > ["green pstr",], > ["greenpstr",], > ["wlmrt", ], > ["walmart",], > ["walmart super",] > ]}} > {{variations = [ > ("type green pstr", "ABC", 100), > ("type green pstr","PQR",200), > ("type green pstr", "NZSD", 2999), > ("wlmrt payment","walmart",200), > ("wlmrt solutions", "walmart", 200), > ("nppssdwlmrt", "walmart", 2000) > ]}} > {{Imagine I have a task to do fuzzy substring matching between keyword and > variation[0] using in built levenstein function. It is possible to optimize > this futher in the code itself where we extract out the uniques and then do > fuzzy matching on the uniques and join back with the original tables. }} > {{But it could be possible as an optimization to cache the results of the > already computed udfs till now and do a lookup on each executor separately.}} > Just a thought. Not sure if it makes any sense. This behaviour could be > behind a config. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org