So for vectorized UDF if it's still a simple mathematical expression we could transpile it. Error message equality I think is out of scope, that's a good call out.
On Sun, Dec 21, 2025 at 6:42 PM Wenchen Fan <[email protected]> wrote: > The idea sounds good but I'm also worried about the coverage. In the > recent Spark releases, pandas/arrow UDFs get more support than the classic > Python UDFs, but I don't think we can translate pandas/arrow UDFs as we > don't have vectorized operators in Spark out of the box. > > It's also hard to simulate the behaviors exactly, such as overflow > behavior, NULL behavior, error message, etc. Is 100% same behavior the goal > of transpilation? > > On Sat, Dec 20, 2025 at 5:14 PM Holden Karau <[email protected]> > wrote: > >> Responses in line, thanks for the questions :) >> >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >> >> On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim < >> [email protected]> wrote: >> >>> Thanks for the proposal. UDF has been known to be noticeably slow, >>> especially for the language where we run the external process and do >>> intercommunication, so this is an interesting topic. >>> >>> The starting question for this proposal would be the coverage. The >>> proposal says we create an AST and try to convert it to a catalyst plan. >>> Since this does not sound like we are generating Java/bytecode so I assume >>> this only leverages built-in operators/expressions. >>> >> Initially yes. Longer term I think it’s possible we explore transpiling >> to other languages (especially accelerator languages as called out in the >> docs), but that’s fuzzy. >> >>> >>> That said, when we say "simple" UDF, what is exactly the scope of >>> "simple" here? For me, it sounds to me like if the UDF can be translated to >>> a catalyst plan (without UDF), the UDF has actually been something users >>> could have written via the DataFrame API without UDF, unless we have >>> non-user-facing expressions where users are needed. Same with Pandas on >>> Spark for covering Pandas UDF. Do we see such a case e.g. users fail to >>> write logic based on built-in SQL expressions while they can, and end up >>> with choosing UDF? I think this needs more clarification given that's >>> really a user facing contract and the factor of evaluating this project as >>> a successful one. >>> >> Given the transpiration target is Catalyst, yes these would mostly be >> things someone could express with SQL but expressed in another way. >> >> We do have some Catalyst expressions which aren’t directly SQL >> expressions so not always, but generally. >> >> To be clear: I don’t think we should expect users, especially Pandas on >> Spark users, to rewrite their data frame UDFS to SQL and that’s why this >> project makes sense. >> >>> >>> Once that is clarified, we may have follow-up questions/voices with the >>> answer, something along the line: >>> >>> 1. It might be the case we may just want this proposal to be direct to >>> the "future success", translating Python UDF to Java code (codegen) to >>> cover arbitrary logic (unless it's not involving python library, which we >>> had to find alternatives). >>> >> I think this can be a reasonable follow on this project if this project >> is successful. >> >>> >>> 2. We might want to make sure this proposal is addressing major use >>> cases and not just niche cases. e.g. it might be the case the majority of >>> Python UDF usage is to pull other Python dependencies, then we lose >>> the majority of cases. >>> >> I think we don’t expect to cover the majority of UDFS. Even while >> covering only the simple cases initially it would have a real performance >> improvement, especially for Pandas on Spark where people can’t express many >> of these things easily. >> >>> >>> Hope I understand the proposal well and ask valid questions. >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> On Sat, Dec 20, 2025 at 5:42 AM Holden Karau <[email protected]> >>> wrote: >>> >>>> Hi Folks, >>>> >>>> It's been a few years since we last looked at transpilation, and with >>>> the growth of Pandas on Spark I think it's time we revisit it. I've got a >>>> JIRA >>>> filed <https://issues.apache.org/jira/browse/SPARK-54783> some rough >>>> proof of concept code <https://github.com/apache/spark/pull/53547> (I >>>> think doing the transpilation Python side instead of Scala side makes more >>>> sense, but was interesting to play with), and of course everyones >>>> favourite >>>> a design doc. >>>> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing> >>>> (I >>>> also have a collection of YouTube streams playing with the idea >>>> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants to >>>> follow along on that journey). >>>> >>>> Wishing everyone a happy holidays :) >>>> >>>> Cheers, >>>> >>>> Holden :) >>>> >>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> Pronouns: she/her >>>> >>> -- Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her
