Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Holden Karau Sun, 28 Dec 2025 23:29:08 -0800

So for vectorized UDF if it's still a simple mathematical expression we
could transpile it. Error message equality I think is out of scope, that's
a good call out.


On Sun, Dec 21, 2025 at 6:42 PM Wenchen Fan <[email protected]> wrote:

> The idea sounds good but I'm also worried about the coverage. In the
> recent Spark releases, pandas/arrow UDFs get more support than the classic
> Python UDFs, but I don't think we can translate pandas/arrow UDFs as we
> don't have vectorized operators in Spark out of the box.
>
> It's also hard to simulate the behaviors exactly, such as overflow
> behavior, NULL behavior, error message, etc. Is 100% same behavior the goal
> of transpilation?
>
> On Sat, Dec 20, 2025 at 5:14 PM Holden Karau <[email protected]>
> wrote:
>
>> Responses in line, thanks for the questions :)
>>
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>> On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim <
>> [email protected]> wrote:
>>
>>> Thanks for the proposal. UDF has been known to be noticeably slow,
>>> especially for the language where we run the external process and do
>>> intercommunication, so this is an interesting topic.
>>>
>>> The starting question for this proposal would be the coverage. The
>>> proposal says we create an AST and try to convert it to a catalyst plan.
>>> Since this does not sound like we are generating Java/bytecode so I assume
>>> this only leverages built-in operators/expressions.
>>>
>> Initially yes. Longer term I think it’s possible we explore transpiling
>> to other languages (especially accelerator languages as called out in the
>> docs), but that’s fuzzy.
>>
>>>
>>> That said, when we say "simple" UDF, what is exactly the scope of
>>> "simple" here? For me, it sounds to me like if the UDF can be translated to
>>> a catalyst plan (without UDF), the UDF has actually been something users
>>> could have written via the DataFrame API without UDF, unless we have
>>> non-user-facing expressions where users are needed. Same with Pandas on
>>> Spark for covering Pandas UDF. Do we see such a case e.g. users fail to
>>> write logic based on built-in SQL expressions while they can, and end up
>>> with choosing UDF? I think this needs more clarification given that's
>>> really a user facing contract and the factor of evaluating this project as
>>> a successful one.
>>>
>> Given the transpiration target is Catalyst, yes these would mostly be
>> things someone could express with SQL but expressed in another way.
>>
>> We do have some Catalyst expressions which aren’t directly SQL
>> expressions so not always, but generally.
>>
>> To be clear: I don’t think we should expect users, especially Pandas on
>> Spark users, to rewrite their data frame UDFS to SQL and that’s why this
>> project makes sense.
>>
>>>
>>> Once that is clarified, we may have follow-up questions/voices with the
>>> answer, something along the line:
>>>
>>> 1. It might be the case we may just want this proposal to be direct to
>>> the "future success", translating Python UDF to Java code (codegen) to
>>> cover arbitrary logic (unless it's not involving python library, which we
>>> had to find alternatives).
>>>
>> I think this can be a reasonable follow on this project if this project
>> is successful.
>>
>>>
>>> 2. We might want to make sure this proposal is addressing major use
>>> cases and not just niche cases. e.g. it might be the case the majority of
>>> Python UDF usage is to pull other Python dependencies, then we lose
>>> the majority of cases.
>>>
>> I think we don’t expect to cover the majority of UDFS. Even while
>> covering only the simple cases initially it would have a real performance
>> improvement, especially for Pandas on Spark where people can’t express many
>> of these things easily.
>>
>>>
>>> Hope I understand the proposal well and ask valid questions.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Sat, Dec 20, 2025 at 5:42 AM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> Hi Folks,
>>>>
>>>> It's been a few years since we last looked at transpilation, and with
>>>> the growth of Pandas on Spark I think it's time we revisit it. I've got a 
>>>> JIRA
>>>> filed <https://issues.apache.org/jira/browse/SPARK-54783> some rough
>>>> proof of concept code <https://github.com/apache/spark/pull/53547> (I
>>>> think doing the transpilation Python side instead of Scala side makes more
>>>> sense, but was interesting to play with), and  of course everyones 
>>>> favourite
>>>> a design doc.
>>>> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing>
>>>>  (I
>>>> also have a collection of YouTube streams playing with the idea
>>>> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants to
>>>> follow along on that journey).
>>>>
>>>> Wishing everyone a happy holidays :)
>>>>
>>>> Cheers,
>>>>
>>>> Holden :)
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>

-- 
Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her

Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Reply via email to