Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Holden Karau Mon, 29 Dec 2025 09:04:31 -0800

I don’t think raising an error makes sense, we only expect cover some
simple UDFS and when not supported we execute them as normal.


Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her


On Mon, Dec 29, 2025 at 8:33 AM serge rielau.com <[email protected]> wrote:

> One important aspect of coverage is to draw a clear line on what is, and
> what is not covered.
> I may go as far as propose to use explicit syntax to denote the intent to
> transpile. Then, Spark cannot do it, we can raise an error at DDL and the
> user at is not at a loss why their function is slower than expected.
> Or why a small bugfix in its body suddenly regresses perfromance.
>
>
> On Dec 28, 2025, at 11:28 PM, Holden Karau <[email protected]> wrote:
>
> So for vectorized UDF if it's still a simple mathematical expression we
> could transpile it. Error message equality I think is out of scope, that's
> a good call out.
>
> On Sun, Dec 21, 2025 at 6:42 PM Wenchen Fan <[email protected]> wrote:
>
>> The idea sounds good but I'm also worried about the coverage. In the
>> recent Spark releases, pandas/arrow UDFs get more support than the classic
>> Python UDFs, but I don't think we can translate pandas/arrow UDFs as we
>> don't have vectorized operators in Spark out of the box.
>>
>> It's also hard to simulate the behaviors exactly, such as overflow
>> behavior, NULL behavior, error message, etc. Is 100% same behavior the goal
>> of transpilation?
>>
>> On Sat, Dec 20, 2025 at 5:14 PM Holden Karau <[email protected]>
>> wrote:
>>
>>> Responses in line, thanks for the questions :)
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>>
>>> On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim <
>>> [email protected]> wrote:
>>>
>>>> Thanks for the proposal. UDF has been known to be noticeably slow,
>>>> especially for the language where we run the external process and do
>>>> intercommunication, so this is an interesting topic.
>>>>
>>>> The starting question for this proposal would be the coverage. The
>>>> proposal says we create an AST and try to convert it to a catalyst plan.
>>>> Since this does not sound like we are generating Java/bytecode so I assume
>>>> this only leverages built-in operators/expressions.
>>>>
>>> Initially yes. Longer term I think it’s possible we explore transpiling
>>> to other languages (especially accelerator languages as called out in the
>>> docs), but that’s fuzzy.
>>>
>>>>
>>>> That said, when we say "simple" UDF, what is exactly the scope of
>>>> "simple" here? For me, it sounds to me like if the UDF can be translated to
>>>> a catalyst plan (without UDF), the UDF has actually been something users
>>>> could have written via the DataFrame API without UDF, unless we have
>>>> non-user-facing expressions where users are needed. Same with Pandas on
>>>> Spark for covering Pandas UDF. Do we see such a case e.g. users fail to
>>>> write logic based on built-in SQL expressions while they can, and end up
>>>> with choosing UDF? I think this needs more clarification given that's
>>>> really a user facing contract and the factor of evaluating this project as
>>>> a successful one.
>>>>
>>> Given the transpiration target is Catalyst, yes these would mostly be
>>> things someone could express with SQL but expressed in another way.
>>>
>>> We do have some Catalyst expressions which aren’t directly SQL
>>> expressions so not always, but generally.
>>>
>>> To be clear: I don’t think we should expect users, especially Pandas on
>>> Spark users, to rewrite their data frame UDFS to SQL and that’s why this
>>> project makes sense.
>>>
>>>>
>>>> Once that is clarified, we may have follow-up questions/voices with the
>>>> answer, something along the line:
>>>>
>>>> 1. It might be the case we may just want this proposal to be direct to
>>>> the "future success", translating Python UDF to Java code (codegen) to
>>>> cover arbitrary logic (unless it's not involving python library, which we
>>>> had to find alternatives).
>>>>
>>> I think this can be a reasonable follow on this project if this project
>>> is successful.
>>>
>>>>
>>>> 2. We might want to make sure this proposal is addressing major use
>>>> cases and not just niche cases. e.g. it might be the case the majority of
>>>> Python UDF usage is to pull other Python dependencies, then we lose
>>>> the majority of cases.
>>>>
>>> I think we don’t expect to cover the majority of UDFS. Even while
>>> covering only the simple cases initially it would have a real performance
>>> improvement, especially for Pandas on Spark where people can’t express many
>>> of these things easily.
>>>
>>>>
>>>> Hope I understand the proposal well and ask valid questions.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> On Sat, Dec 20, 2025 at 5:42 AM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Folks,
>>>>>
>>>>> It's been a few years since we last looked at transpilation, and with
>>>>> the growth of Pandas on Spark I think it's time we revisit it. I've got a 
>>>>> JIRA
>>>>> filed <https://issues.apache.org/jira/browse/SPARK-54783> some rough
>>>>> proof of concept code <https://github.com/apache/spark/pull/53547> (I
>>>>> think doing the transpilation Python side instead of Scala side makes more
>>>>> sense, but was interesting to play with), and  of course everyones 
>>>>> favourite
>>>>> a design doc.
>>>>> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing>
>>>>>  (I
>>>>> also have a collection of YouTube streams playing with the idea
>>>>> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants to
>>>>> follow along on that journey).
>>>>>
>>>>> Wishing everyone a happy holidays :)
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden :)
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
>

Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Reply via email to