[ https://issues.apache.org/jira/browse/SPARK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943176#comment-16943176 ]
Anton Baranau edited comment on SPARK-18748 at 10/2/19 9:32 PM: ---------------------------------------------------------------- I got the same problem having the code below with 2.4.4 {code:python} df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).selectExpr('scores.score1', 'scores.score2') {code} In my case data isn't huge so I can afford to cache it like below {code:python} df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).cache().selectExpr('scores.score1', 'scores.score2') {code} was (Author: ahtokca): I got the same problem having the code below with 2.4.4 {code:python} df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).selectExpr('scores.score1', 'scores.score2') {code} In my case data isn't huge so I can afford to cahce it like below {code:python} df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).cache().selectExpr('scores.score1', 'scores.score2') {code} > UDF multiple evaluations causes very poor performance > ----------------------------------------------------- > > Key: SPARK-18748 > URL: https://issues.apache.org/jira/browse/SPARK-18748 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0, 2.4.0 > Reporter: Ohad Raviv > Priority: Major > > We have a use case where we have a relatively expensive UDF that needs to be > calculated. The problem is that instead of being calculated once, it gets > calculated over and over again. > for example: > {quote} > def veryExpensiveCalc(str:String) = \{println("blahblah1"); "nothing"\} > hiveContext.udf.register("veryExpensiveCalc", veryExpensiveCalc _) > hiveContext.sql("select * from (select veryExpensiveCalc('a') c)z where c is > not null and c<>''").show > {quote} > with the output: > {quote} > blahblah1 > blahblah1 > blahblah1 > +-------+ > | c| > +-------+ > |nothing| > +-------+ > {quote} > You can see that for each reference of column "c" you will get the println. > that causes very poor performance for our real use case. > This also came out on StackOverflow: > http://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns > http://stackoverflow.com/questions/34587596/trying-to-turn-a-blob-into-multiple-columns-in-spark/ > with two problematic work-arounds: > 1. cache() after the first time. e.g. > {quote} > hiveContext.sql("select veryExpensiveCalc('a') as c").cache().where("c is not > null and c<>''").show > {quote} > while it works, in our case we can't do that because the table is too big to > cache. > 2. move back and forth to rdd: > {quote} > val df = hiveContext.sql("select veryExpensiveCalc('a') as c") > hiveContext.createDataFrame(df.rdd, df.schema).where("c is not null and > c<>''").show > {quote} > which works but then we loose some of the optimizations like push down > predicate features, etc. and its very ugly. > Any ideas on how we can make the UDF get calculated just once in a reasonable > way? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org