Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Holden Karau Mon, 29 Dec 2025 11:47:39 -0800

Oooh what about another way: if we expose in either the logs or the query
plan if a UDF has been transpiled? That way a user investigating a
regression can see?


Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her


On Mon, Dec 29, 2025 at 11:45 AM Holden Karau <[email protected]>
wrote:

> So most of our optimizer rules fall back gracefully when they can’t be
> applied, for example filter push down if it can’t push a filter through
> doesn’t raise an error. I’m thinking of this more like an optimizer rule
> personally.
>
> That’s why I don’t think we should try to expose transpilation to the user
> level like that, especially given we want to accelerate pandas on spark
> where we don’t really control the API fully. Do you have an idea of what
> you’d want that to look like though?
>
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
> On Mon, Dec 29, 2025 at 9:55 AM serge rielau.com <[email protected]> wrote:
>
>> How about a compromise? If the user expects transpilation, via a syntax
>> clause we raise an error.
>> If the user says nothing then it’s best effort.
>> That’s also an easy way for a user to verify whether their code applies.
>> On Dec 29, 2025 at 9:04 AM -0800, Holden Karau <[email protected]>,
>> wrote:
>>
>> I don’t think raising an error makes sense, we only expect cover some
>> simple UDFS and when not supported we execute them as normal.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>> On Mon, Dec 29, 2025 at 8:33 AM serge rielau.com <[email protected]>
>> wrote:
>>
>>> One important aspect of coverage is to draw a clear line on what is, and
>>> what is not covered.
>>> I may go as far as propose to use explicit syntax to denote the intent
>>> to transpile. Then, Spark cannot do it, we can raise an error at DDL and
>>> the user at is not at a loss why their function is slower than expected.
>>> Or why a small bugfix in its body suddenly regresses perfromance.
>>>
>>>
>>> On Dec 28, 2025, at 11:28 PM, Holden Karau <[email protected]>
>>> wrote:
>>>
>>> So for vectorized UDF if it's still a simple mathematical expression we
>>> could transpile it. Error message equality I think is out of scope, that's
>>> a good call out.
>>>
>>> On Sun, Dec 21, 2025 at 6:42 PM Wenchen Fan <[email protected]> wrote:
>>>
>>>> The idea sounds good but I'm also worried about the coverage. In the
>>>> recent Spark releases, pandas/arrow UDFs get more support than the classic
>>>> Python UDFs, but I don't think we can translate pandas/arrow UDFs as we
>>>> don't have vectorized operators in Spark out of the box.
>>>>
>>>> It's also hard to simulate the behaviors exactly, such as overflow
>>>> behavior, NULL behavior, error message, etc. Is 100% same behavior the goal
>>>> of transpilation?
>>>>
>>>> On Sat, Dec 20, 2025 at 5:14 PM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> Responses in line, thanks for the questions :)
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>>
>>>>> On Fri, Dec 19, 2025 at 10:35 PM Jungtaek Lim <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks for the proposal. UDF has been known to be noticeably slow,
>>>>>> especially for the language where we run the external process and do
>>>>>> intercommunication, so this is an interesting topic.
>>>>>>
>>>>>> The starting question for this proposal would be the coverage. The
>>>>>> proposal says we create an AST and try to convert it to a catalyst plan.
>>>>>> Since this does not sound like we are generating Java/bytecode so I 
>>>>>> assume
>>>>>> this only leverages built-in operators/expressions.
>>>>>>
>>>>> Initially yes. Longer term I think it’s possible we explore
>>>>> transpiling to other languages (especially accelerator languages as called
>>>>> out in the docs), but that’s fuzzy.
>>>>>
>>>>>>
>>>>>> That said, when we say "simple" UDF, what is exactly the scope of
>>>>>> "simple" here? For me, it sounds to me like if the UDF can be translated 
>>>>>> to
>>>>>> a catalyst plan (without UDF), the UDF has actually been something users
>>>>>> could have written via the DataFrame API without UDF, unless we have
>>>>>> non-user-facing expressions where users are needed. Same with Pandas on
>>>>>> Spark for covering Pandas UDF. Do we see such a case e.g. users fail to
>>>>>> write logic based on built-in SQL expressions while they can, and end up
>>>>>> with choosing UDF? I think this needs more clarification given that's
>>>>>> really a user facing contract and the factor of evaluating this project 
>>>>>> as
>>>>>> a successful one.
>>>>>>
>>>>> Given the transpiration target is Catalyst, yes these would mostly be
>>>>> things someone could express with SQL but expressed in another way.
>>>>>
>>>>> We do have some Catalyst expressions which aren’t directly SQL
>>>>> expressions so not always, but generally.
>>>>>
>>>>> To be clear: I don’t think we should expect users, especially Pandas
>>>>> on Spark users, to rewrite their data frame UDFS to SQL and that’s why 
>>>>> this
>>>>> project makes sense.
>>>>>
>>>>>>
>>>>>> Once that is clarified, we may have follow-up questions/voices with
>>>>>> the answer, something along the line:
>>>>>>
>>>>>> 1. It might be the case we may just want this proposal to be direct
>>>>>> to the "future success", translating Python UDF to Java code (codegen) to
>>>>>> cover arbitrary logic (unless it's not involving python library, which we
>>>>>> had to find alternatives).
>>>>>>
>>>>> I think this can be a reasonable follow on this project if this
>>>>> project is successful.
>>>>>
>>>>>>
>>>>>> 2. We might want to make sure this proposal is addressing major use
>>>>>> cases and not just niche cases. e.g. it might be the case the majority of
>>>>>> Python UDF usage is to pull other Python dependencies, then we lose
>>>>>> the majority of cases.
>>>>>>
>>>>> I think we don’t expect to cover the majority of UDFS. Even while
>>>>> covering only the simple cases initially it would have a real performance
>>>>> improvement, especially for Pandas on Spark where people can’t express 
>>>>> many
>>>>> of these things easily.
>>>>>
>>>>>>
>>>>>> Hope I understand the proposal well and ask valid questions.
>>>>>>
>>>>>> Thanks,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>> On Sat, Dec 20, 2025 at 5:42 AM Holden Karau <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Folks,
>>>>>>>
>>>>>>> It's been a few years since we last looked at transpilation, and
>>>>>>> with the growth of Pandas on Spark I think it's time we revisit it. I've
>>>>>>> got a JIRA filed <https://issues.apache.org/jira/browse/SPARK-54783>
>>>>>>>  some rough proof of concept code
>>>>>>> <https://github.com/apache/spark/pull/53547> (I think doing the
>>>>>>> transpilation Python side instead of Scala side makes more sense, but 
>>>>>>> was
>>>>>>> interesting to play with), and  of course everyones favourite a
>>>>>>> design doc.
>>>>>>> <https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?usp=sharing>
>>>>>>>  (I
>>>>>>> also have a collection of YouTube streams playing with the idea
>>>>>>> <https://www.youtube.com/@HoldenKarau/streams> if anyone wants to
>>>>>>> follow along on that journey).
>>>>>>>
>>>>>>> Wishing everyone a happy holidays :)
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Holden :)
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>> Pronouns: she/her
>>>>>>>
>>>>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>>
>>>

Re: [DISCUSS] SPIP: Improving Spark SQL UDFs with Transpilation

Reply via email to