I don’t think raising an error makes sense, we only expect cover some simple UDFS and when not supported we execute them as normal.
Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her On Mon, Dec 29, 2025 at 8:33 AM serge rielau.com <[email protected]> wrote: > One important aspect of coverage is to draw a clear line on what is, and > what is not covered. > I may go as far as propose to use explicit syntax to denote the intent to > transpile. Then, Spark cannot do it, we can raise an error at DDL and the > user at is not at a loss why their function is slower than expected. > Or why a small bugfix in its body suddenly regresses perfromance. > > > On Dec 28, 2025, at 11:28 PM, Holden Karau <[email protected]> wrote: > > So for vectorized UDF if it's still a simple mathematical expression we > could transpile it. Error message equality I think is out of scope, that's > a good call out. > > On Sun, Dec 21, 2025 at 6:42 PM Wenchen Fan <[email protected]> wrote: > >> The idea sounds good but I'm also worried about the coverage. In the >> recent Spark releases, pandas/arrow UDFs get more support than the classic >> Python UDFs, but I don't think we can translate pandas/arrow UDFs as we >> don't have vectorized operators in Spark out of the box. >> >> It's also hard to simulate the behaviors exactly, such as overflow >> behavior, NULL behavior, error message, etc. Is 100% same behavior the goal >> of transpilation? >> >> On Sat, Dec 20, 2025 at 5:14 PM Holden Karau <[email protected]> >> wrote: >> >>> Responses in line, thanks for the questions :) >>> >>> Twitter: https://twitter.com/holdenkarau >>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>> <https://www.fighthealthinsurance.com/?q=hk_email> >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> Pronouns: she/her >>> >>> >>> On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim < >>> [email protected]> wrote: >>> >>>> Thanks for the proposal. UDF has been known to be noticeably slow, >>>> especially for the language where we run the external process and do >>>> intercommunication, so this is an interesting topic. >>>> >>>> The starting question for this proposal would be the coverage. The >>>> proposal says we create an AST and try to convert it to a catalyst plan. >>>> Since this does not sound like we are generating Java/bytecode so I assume >>>> this only leverages built-in operators/expressions. >>>> >>> Initially yes. Longer term I think it’s possible we explore transpiling >>> to other languages (especially accelerator languages as called out in the >>> docs), but that’s fuzzy. >>> >>>> >>>> That said, when we say "simple" UDF, what is exactly the scope of >>>> "simple" here? For me, it sounds to me like if the UDF can be translated to >>>> a catalyst plan (without UDF), the UDF has actually been something users >>>> could have written via the DataFrame API without UDF, unless we have >>>> non-user-facing expressions where users are needed. Same with Pandas on >>>> Spark for covering Pandas UDF. Do we see such a case e.g. users fail to >>>> write logic based on built-in SQL expressions while they can, and end up >>>> with choosing UDF? I think this needs more clarification given that's >>>> really a user facing contract and the factor of evaluating this project as >>>> a successful one. >>>> >>> Given the transpiration target is Catalyst, yes these would mostly be >>> things someone could express with SQL but expressed in another way. >>> >>> We do have some Catalyst expressions which aren’t directly SQL >>> expressions so not always, but generally. >>> >>> To be clear: I don’t think we should expect users, especially Pandas on >>> Spark users, to rewrite their data frame UDFS to SQL and that’s why this >>> project makes sense. >>> >>>> >>>> Once that is clarified, we may have follow-up questions/voices with the >>>> answer, something along the line: >>>> >>>> 1. It might be the case we may just want this proposal to be direct to >>>> the "future success", translating Python UDF to Java code (codegen) to >>>> cover arbitrary logic (unless it's not involving python library, which we >>>> had to find alternatives). >>>> >>> I think this can be a reasonable follow on this project if this project >>> is successful. >>> >>>> >>>> 2. We might want to make sure this proposal is addressing major use >>>> cases and not just niche cases. e.g. it might be the case the majority of >>>> Python UDF usage is to pull other Python dependencies, then we lose >>>> the majority of cases. >>>> >>> I think we don’t expect to cover the majority of UDFS. Even while >>> covering only the simple cases initially it would have a real performance >>> improvement, especially for Pandas on Spark where people can’t express many >>> of these things easily. >>> >>>> >>>> Hope I understand the proposal well and ask valid questions. >>>> >>>> Thanks, >>>> Jungtaek Lim (HeartSaVioR) >>>> >>>> On Sat, Dec 20, 2025 at 5:42 AM Holden Karau <[email protected]> >>>> wrote: >>>> >>>>> Hi Folks, >>>>> >>>>> It's been a few years since we last looked at transpilation, and with >>>>> the growth of Pandas on Spark I think it's time we revisit it. I've got a >>>>> JIRA >>>>> filed <https://issues.apache.org/jira/browse/SPARK-54783> some rough >>>>> proof of concept code <https://github.com/apache/spark/pull/53547> (I >>>>> think doing the transpilation Python side instead of Scala side makes more >>>>> sense, but was interesting to play with), and of course everyones >>>>> favourite >>>>> a design doc. >>>>> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing> >>>>> (I >>>>> also have a collection of YouTube streams playing with the idea >>>>> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants to >>>>> follow along on that journey). >>>>> >>>>> Wishing everyone a happy holidays :) >>>>> >>>>> Cheers, >>>>> >>>>> Holden :) >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> Pronouns: she/her >>>>> >>>> > > -- > Twitter: https://twitter.com/holdenkarau > Fight Health Insurance: https://www.fighthealthinsurance.com/ > <https://www.fighthealthinsurance.com/?q=hk_email> > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her > > >
