The idea sounds good but I'm also worried about the coverage. In the recent Spark releases, pandas/arrow UDFs get more support than the classic Python UDFs, but I don't think we can translate pandas/arrow UDFs as we don't have vectorized operators in Spark out of the box.
It's also hard to simulate the behaviors exactly, such as overflow behavior, NULL behavior, error message, etc. Is 100% same behavior the goal of transpilation? On Sat, Dec 20, 2025 at 5:14 PM Holden Karau <[email protected]> wrote: > Responses in line, thanks for the questions :) > > Twitter: https://twitter.com/holdenkarau > Fight Health Insurance: https://www.fighthealthinsurance.com/ > <https://www.fighthealthinsurance.com/?q=hk_email> > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her > > > On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim < > [email protected]> wrote: > >> Thanks for the proposal. UDF has been known to be noticeably slow, >> especially for the language where we run the external process and do >> intercommunication, so this is an interesting topic. >> >> The starting question for this proposal would be the coverage. The >> proposal says we create an AST and try to convert it to a catalyst plan. >> Since this does not sound like we are generating Java/bytecode so I assume >> this only leverages built-in operators/expressions. >> > Initially yes. Longer term I think it’s possible we explore transpiling to > other languages (especially accelerator languages as called out in the > docs), but that’s fuzzy. > >> >> That said, when we say "simple" UDF, what is exactly the scope of >> "simple" here? For me, it sounds to me like if the UDF can be translated to >> a catalyst plan (without UDF), the UDF has actually been something users >> could have written via the DataFrame API without UDF, unless we have >> non-user-facing expressions where users are needed. Same with Pandas on >> Spark for covering Pandas UDF. Do we see such a case e.g. users fail to >> write logic based on built-in SQL expressions while they can, and end up >> with choosing UDF? I think this needs more clarification given that's >> really a user facing contract and the factor of evaluating this project as >> a successful one. >> > Given the transpiration target is Catalyst, yes these would mostly be > things someone could express with SQL but expressed in another way. > > We do have some Catalyst expressions which aren’t directly SQL expressions > so not always, but generally. > > To be clear: I don’t think we should expect users, especially Pandas on > Spark users, to rewrite their data frame UDFS to SQL and that’s why this > project makes sense. > >> >> Once that is clarified, we may have follow-up questions/voices with the >> answer, something along the line: >> >> 1. It might be the case we may just want this proposal to be direct to >> the "future success", translating Python UDF to Java code (codegen) to >> cover arbitrary logic (unless it's not involving python library, which we >> had to find alternatives). >> > I think this can be a reasonable follow on this project if this project is > successful. > >> >> 2. We might want to make sure this proposal is addressing major use cases >> and not just niche cases. e.g. it might be the case the majority of Python >> UDF usage is to pull other Python dependencies, then we lose the majority >> of cases. >> > I think we don’t expect to cover the majority of UDFS. Even while covering > only the simple cases initially it would have a real performance > improvement, especially for Pandas on Spark where people can’t express many > of these things easily. > >> >> Hope I understand the proposal well and ask valid questions. >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> >> On Sat, Dec 20, 2025 at 5:42 AM Holden Karau <[email protected]> >> wrote: >> >>> Hi Folks, >>> >>> It's been a few years since we last looked at transpilation, and with >>> the growth of Pandas on Spark I think it's time we revisit it. I've got a >>> JIRA >>> filed <https://issues.apache.org/jira/browse/SPARK-54783> some rough >>> proof of concept code <https://github.com/apache/spark/pull/53547> (I >>> think doing the transpilation Python side instead of Scala side makes more >>> sense, but was interesting to play with), and of course everyones favourite >>> a design doc. >>> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing> >>> (I >>> also have a collection of YouTube streams playing with the idea >>> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants to >>> follow along on that journey). >>> >>> Wishing everyone a happy holidays :) >>> >>> Cheers, >>> >>> Holden :) >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>> <https://www.fighthealthinsurance.com/?q=hk_email> >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> Pronouns: she/her >>> >>
