Oooh what about another way: if we expose in either the logs or the query plan if a UDF has been transpiled? That way a user investigating a regression can see?
Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her On Mon, Dec 29, 2025 at 11:45 AM Holden Karau <[email protected]> wrote: > So most of our optimizer rules fall back gracefully when they can’t be > applied, for example filter push down if it can’t push a filter through > doesn’t raise an error. I’m thinking of this more like an optimizer rule > personally. > > That’s why I don’t think we should try to expose transpilation to the user > level like that, especially given we want to accelerate pandas on spark > where we don’t really control the API fully. Do you have an idea of what > you’d want that to look like though? > > > Twitter: https://twitter.com/holdenkarau > Fight Health Insurance: https://www.fighthealthinsurance.com/ > <https://www.fighthealthinsurance.com/?q=hk_email> > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her > > > On Mon, Dec 29, 2025 at 9:55 AM serge rielau.com <[email protected]> wrote: > >> How about a compromise? If the user expects transpilation, via a syntax >> clause we raise an error. >> If the user says nothing then it’s best effort. >> That’s also an easy way for a user to verify whether their code applies. >> On Dec 29, 2025 at 9:04 AM -0800, Holden Karau <[email protected]>, >> wrote: >> >> I don’t think raising an error makes sense, we only expect cover some >> simple UDFS and when not supported we execute them as normal. >> >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >> >> On Mon, Dec 29, 2025 at 8:33 AM serge rielau.com <[email protected]> >> wrote: >> >>> One important aspect of coverage is to draw a clear line on what is, and >>> what is not covered. >>> I may go as far as propose to use explicit syntax to denote the intent >>> to transpile. Then, Spark cannot do it, we can raise an error at DDL and >>> the user at is not at a loss why their function is slower than expected. >>> Or why a small bugfix in its body suddenly regresses perfromance. >>> >>> >>> On Dec 28, 2025, at 11:28 PM, Holden Karau <[email protected]> >>> wrote: >>> >>> So for vectorized UDF if it's still a simple mathematical expression we >>> could transpile it. Error message equality I think is out of scope, that's >>> a good call out. >>> >>> On Sun, Dec 21, 2025 at 6:42 PM Wenchen Fan <[email protected]> wrote: >>> >>>> The idea sounds good but I'm also worried about the coverage. In the >>>> recent Spark releases, pandas/arrow UDFs get more support than the classic >>>> Python UDFs, but I don't think we can translate pandas/arrow UDFs as we >>>> don't have vectorized operators in Spark out of the box. >>>> >>>> It's also hard to simulate the behaviors exactly, such as overflow >>>> behavior, NULL behavior, error message, etc. Is 100% same behavior the goal >>>> of transpilation? >>>> >>>> On Sat, Dec 20, 2025 at 5:14 PM Holden Karau <[email protected]> >>>> wrote: >>>> >>>>> Responses in line, thanks for the questions :) >>>>> >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> Pronouns: she/her >>>>> >>>>> >>>>> On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks for the proposal. UDF has been known to be noticeably slow, >>>>>> especially for the language where we run the external process and do >>>>>> intercommunication, so this is an interesting topic. >>>>>> >>>>>> The starting question for this proposal would be the coverage. The >>>>>> proposal says we create an AST and try to convert it to a catalyst plan. >>>>>> Since this does not sound like we are generating Java/bytecode so I >>>>>> assume >>>>>> this only leverages built-in operators/expressions. >>>>>> >>>>> Initially yes. Longer term I think it’s possible we explore >>>>> transpiling to other languages (especially accelerator languages as called >>>>> out in the docs), but that’s fuzzy. >>>>> >>>>>> >>>>>> That said, when we say "simple" UDF, what is exactly the scope of >>>>>> "simple" here? For me, it sounds to me like if the UDF can be translated >>>>>> to >>>>>> a catalyst plan (without UDF), the UDF has actually been something users >>>>>> could have written via the DataFrame API without UDF, unless we have >>>>>> non-user-facing expressions where users are needed. Same with Pandas on >>>>>> Spark for covering Pandas UDF. Do we see such a case e.g. users fail to >>>>>> write logic based on built-in SQL expressions while they can, and end up >>>>>> with choosing UDF? I think this needs more clarification given that's >>>>>> really a user facing contract and the factor of evaluating this project >>>>>> as >>>>>> a successful one. >>>>>> >>>>> Given the transpiration target is Catalyst, yes these would mostly be >>>>> things someone could express with SQL but expressed in another way. >>>>> >>>>> We do have some Catalyst expressions which aren’t directly SQL >>>>> expressions so not always, but generally. >>>>> >>>>> To be clear: I don’t think we should expect users, especially Pandas >>>>> on Spark users, to rewrite their data frame UDFS to SQL and that’s why >>>>> this >>>>> project makes sense. >>>>> >>>>>> >>>>>> Once that is clarified, we may have follow-up questions/voices with >>>>>> the answer, something along the line: >>>>>> >>>>>> 1. It might be the case we may just want this proposal to be direct >>>>>> to the "future success", translating Python UDF to Java code (codegen) to >>>>>> cover arbitrary logic (unless it's not involving python library, which we >>>>>> had to find alternatives). >>>>>> >>>>> I think this can be a reasonable follow on this project if this >>>>> project is successful. >>>>> >>>>>> >>>>>> 2. We might want to make sure this proposal is addressing major use >>>>>> cases and not just niche cases. e.g. it might be the case the majority of >>>>>> Python UDF usage is to pull other Python dependencies, then we lose >>>>>> the majority of cases. >>>>>> >>>>> I think we don’t expect to cover the majority of UDFS. Even while >>>>> covering only the simple cases initially it would have a real performance >>>>> improvement, especially for Pandas on Spark where people can’t express >>>>> many >>>>> of these things easily. >>>>> >>>>>> >>>>>> Hope I understand the proposal well and ask valid questions. >>>>>> >>>>>> Thanks, >>>>>> Jungtaek Lim (HeartSaVioR) >>>>>> >>>>>> On Sat, Dec 20, 2025 at 5:42 AM Holden Karau <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Folks, >>>>>>> >>>>>>> It's been a few years since we last looked at transpilation, and >>>>>>> with the growth of Pandas on Spark I think it's time we revisit it. I've >>>>>>> got a JIRA filed <https://issues.apache.org/jira/browse/SPARK-54783> >>>>>>> some rough proof of concept code >>>>>>> <https://github.com/apache/spark/pull/53547> (I think doing the >>>>>>> transpilation Python side instead of Scala side makes more sense, but >>>>>>> was >>>>>>> interesting to play with), and of course everyones favourite a >>>>>>> design doc. >>>>>>> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing> >>>>>>> (I >>>>>>> also have a collection of YouTube streams playing with the idea >>>>>>> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants to >>>>>>> follow along on that journey). >>>>>>> >>>>>>> Wishing everyone a happy holidays :) >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Holden :) >>>>>>> >>>>>>> -- >>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>> Pronouns: she/her >>>>>>> >>>>>> >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>> <https://www.fighthealthinsurance.com/?q=hk_email> >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> Pronouns: she/her >>> >>> >>>
